0% found this document useful (0 votes)

16 views6 pages

Texttech Ex06 Solution

texttech-ex06-solution

Uploaded by

s03120204

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views6 pages

Texttech Ex06 Solution

texttech-ex06-solution

Uploaded by

s03120204

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Text Technology

Summer 2024
Johannes Dellert

Exercise Sheet 06:

NLP Tools on the Command Line,
Challenges of Using NLP Tools (Solution)
handed out: 9 June, 18:00
to be submitted by: 17 June, 06:00

IMPORTANT: For all tasks that require usage of the command line: either copy and paste the com-
mand you used and the output you are shown or include a screenshot of the command line
with the command and the output.

HINT FOR WINDOWS USERS: If you cannot get OpenNLP to run in your Linux subsystem, you can
instead run it natively on the Windows command line (run cmd) using the .bat file provided. Pipes and
redirection are available, but not the usual range of Unix tools (like grep, sed, wc). Our recommendation
is to run the Linux subsystem and Cmd in parallel, to run OpenNLP within Cmd, and then to redirect
the output into a file (using >), which you can then read in (using <) for further processing by the Unix tools.

Task 1: Tokeniser
In this task, we will return to the OpenSubtitles parallel corpora. You are provided with a file to work with
(grail_corpus.txt), but if your parallel corpus from Exercise Sheet 03 has a language for which models
compatible with OpenNLP are available to download, feel free to use your data. Your goal in this task will
be to compare naive token count from raw source data with token count based on tokenised data.

For each of the three languages, do the following:

1. Build a command line pipeline to select lines in the given language from the corpus and do a naive
token count for each line, i.e. using spaces as delimiters and using the raw data. Save the output to
a file, so that you can later use it for comparison.
2. Introduce the OpenNLP tokeniser into your pipeline, i.e. tokenise the lines and then perform the
count. Again, save the output to a file.
3. Using the data in the two saved files, compute the average difference in word count between the naive
and tokenised approaches using the command line.
Look at the raw data compared to output from the tokeniser, what could be an explanation for the differences
you found? How does this differ across languages in the corpus?

Hint: To use OpenNLP with languages other than English, you need to download additional models.
For a selection of languages, you can find official models at https://opennlp.apache.org/models.html
and https://opennlp.sourceforge.net/models-1.5/. The provided file contains English, German, and
Swedish.
Hint: This time, our full solution comprises more than a single line, and it combines the following tools:
awk, nl, openNLP, join.

– 1/6 –
Example solutions:

1. Using awk, we can “filter” out the languages using the remainder of division by 4 on the number of
lines. Remainder 1, will give us English lines, remainder 2 German lines, and remainder 3 Swedish
lines.

The most visible difference is that openNLP separates punctuation marks as separate tokens. The difference
that makes can be seen, for example, in the first sentence of the corpus. We see that English and German
are very similar, and, presumably, we can explain the difference with punctuation marks; there is always
at least one at the end of each sentence, and some are also present within sentences. Although Swedish
deviates more, the difference is not large. The difference is probably not particularly informative, since all
three languages have a quite high degree of similarity and presumably use punctuation marks similarly.

Task 2: POS Tagger

Using OpenNLP toolkits and Unix command lines, annotate the provided rawPOS.txt file and save the
output into tag.txt. You must write all commands as a single line. The saved output format should use
tab separation, like this:

word tag
Then compare tag.txt with the provided hand-labeled en_pud-ud-text.conllu using diff command.
Save the diff output to compare.txt file. This comparison can be done as a separate command. Submit

– 2/6 –
all tag.txt and compare.txt.

Solution:

Task 3: Named Entity Recognition

bin/opennlp SentenceDetector models/en-sent.bin < input/bheki.txt | bin/opennlp TokenizerME
models/en-token.bin | bin/opennlp TokenNameFinder models/en-ner-person.bin
and
bin/opennlp SentenceDetector models/en-sent.bin < input/bheki.txt | bin/opennlp TokenizerME
models/en-token.bin | bin/opennlp TokenNameFinder models/en-ner-organisation.bin

or all together:
bin/opennlp SentenceDetector models/en-sent.bin < input/bheki.txt | bin/opennlp TokenizerME
models/en-token.bin | bin/opennlp TokenNameFinder models/en-ner-person.bin | bin/opennlp
TokenNameFinder models/en-ner-organisation.bin

Problem: foreign/uncommon names that the model was not exposed to/trained on → causes mistakes
in NER ...

Task 4: Parser I
Using the OpenNLP toolkits, parse following rawParse.txt file to obtain the parse tree. Then based on the
parse tree and using the Unix command line, identify the positions of each noun and verb within the tree.
Based on the output, analyse the characteristics of their plcaement.
Solution:

– 3/6 –
By examining the result, you can find out that positions of the verbs are shallower than the nouns.

Task 5: Analysing Parses II

In this task, you will analyse noun phrases across a large amount of data to see if you can find general linguis-
tic patterns. You will work with the English ELRC-837-Legal corpus (provided as elrc_legal_en.txt).
Create an OpenNLP command line pipeline that detects sentences, tokenises and then parses these sentences.
Since this could take a while, it may be a good idea to save the output into a file and then use that as
further input. Use the resulting parses in a new command which finds noun phrases and extracts the left
most POS tag in each noun phrase (If you encounter nested noun phrases, get the deepest tag that is not
an NP). For example:
• From “(NP (DT those)”, we want to extract DT.
• From “(NP (NP (JJ new) (NNS challenges)) (, ,) (PP (VBG including) (NP (VBG rising)
(NN inequality) (CC and) (NN worker) (NN vulnerability))))”, we want to extract JJ and
VBG.
Extend the pipeline to create a list of matched POS tags together with their frequencies. Which adnominal
is the most common?

Hint: Our full solution comprises more than a single line and it combines the following tools: openNLP,
grep, awk, sort, uniq.

Example Solution:

– 4/6 –
Task 6: Challenges
You do not need to use any command line tools for this task.
Imagine you were to use an NLP tool like OpenNLP to perform NLP tasks on non-standard texts, as listed
below. For each scenario, state what problems you would expect to occur and why.

1. Using a tokeniser on a URL such as https://www.reddit.com/r/EthicalLifeProTips/comments/

9fua2c6/elpt_request_how_to_get_to_university_by_bus/
2. Using a sentence detector on lyrics such as
[Verse 1]
Is it getting better?
Or do you feel the same?
Will it make it easier on you
Now you got someone to blame?

[Chorus]
You say one love, one life
When it's one need in the night
One love, we get to share it
[...]

– 5/6 –
(taken from https://genius.com/U2-one-lyrics)
3. Using a parser on a recipe such as
- 1 tablespoon olive oil
- 1 garlic clove, finely chopped
- 8 thin-stemmed asparagus stalks, trimmed
[...]

Directions
Step 1
Heat a small skillet over medium-high heat. Add olive oil and garlic; cook and stir until
garlic is fragrant, about 30 seconds.

Step 2
Add asparagus
[...]

(taken from https://www.allrecipes.com/asparagus-and-eggs-recipe-8634304)

Example solution:
1. incorrect tokenisation due to e.g. slashes, full stops, lack of spaces; URL not structured like natural
language which is what tokenisers are usually trained on...
2. incorrect sentence detection due to e.g. lack of full stops, special notation with brackets in this case, or
the lyrics do not contain proper sentences in the first place; lyrics also are not structured like natural
language which is what sentence detectors are usually trained on...
3. incorrect parsing due to e.g. bullet points/listings of items, headings, non-standard punctuation,
instructions might have non-standard sentence structure; recipes also are not structured like natural
language which is what parsers are usually trained on...

– 6/6 –

Script Virus
No ratings yet
Script Virus
35 pages
Introduction to Python 2018 Edition
From Everand
Introduction to Python 2018 Edition
Mark Lassoff
4/5 (4)
I041 - NLP - Assignment1.ipynb - Colaboratory
No ratings yet
I041 - NLP - Assignment1.ipynb - Colaboratory
11 pages
Korg X5D Manual
No ratings yet
Korg X5D Manual
3 pages
Fundamentals of Programming: Using Python
From Everand
Fundamentals of Programming: Using Python
Bruce Embry
5/5 (2)
The Mac Terminal Reference and Scripting Primer
From Everand
The Mac Terminal Reference and Scripting Primer
Jay Docherty
4.5/5 (3)
Malware Analysis CIS-672: Lecture 03: Inspecting PE Header
No ratings yet
Malware Analysis CIS-672: Lecture 03: Inspecting PE Header
41 pages
Python Programming for Beginners: A guide to Python computer language, computer programming, and learning Python fast!
From Everand
Python Programming for Beginners: A guide to Python computer language, computer programming, and learning Python fast!
Joe Benton
5/5 (1)
MDT 2010 Setup Step by Step
No ratings yet
MDT 2010 Setup Step by Step
35 pages
Windows Batch File Programming
From Everand
Windows Batch File Programming
Michael Elliott
2/5 (2)
Collection of Raspberry Pi Projects
From Everand
Collection of Raspberry Pi Projects
Guillermo Perez Guillen
5/5 (1)
Python for Mechanical and Aerospace Engineering
From Everand
Python for Mechanical and Aerospace Engineering
Alexander Kenan
No ratings yet
The 1 Page Python Book
From Everand
The 1 Page Python Book
Barani Kumar
2/5 (1)
COMPUTER PROGRAMMING FOR KIDS: An Easy Step-by-Step Guide For Young Programmers To Learn Coding Skills (2022 Crash Course for Newbies)
From Everand
COMPUTER PROGRAMMING FOR KIDS: An Easy Step-by-Step Guide For Young Programmers To Learn Coding Skills (2022 Crash Course for Newbies)
Dexter Rogers
No ratings yet
Python Programming: 8 Simple Steps to Learn Python Programming Language in 24 hours! Practical Python Programming for Beginners, Python Commands and Python Language
From Everand
Python Programming: 8 Simple Steps to Learn Python Programming Language in 24 hours! Practical Python Programming for Beginners, Python Commands and Python Language
Norman James
2/5 (1)
Python from the Very Beginning
From Everand
Python from the Very Beginning
John Whitington
No ratings yet
PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project
From Everand
PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project
Mark Chan
5/5 (4)
Python for Beginners: An Introduction to Learn Python Programming with Tutorials and Hands-On Examples
From Everand
Python for Beginners: An Introduction to Learn Python Programming with Tutorials and Hands-On Examples
Nathan Metzler
4/5 (2)
Learn Python in One Hour: Programming by Example
From Everand
Learn Python in One Hour: Programming by Example
Victor R. Volkman
3/5 (2)
SL Arora Physics Class 11 Vol 2 Blunt Library
100% (1)
SL Arora Physics Class 11 Vol 2 Blunt Library
644 pages
Professional Python
From Everand
Professional Python
Luke Sneeringer
No ratings yet
Python Programming For Beginners: Learn The Basics Of Python Programming (Python Crash Course, Programming for Dummies)
From Everand
Python Programming For Beginners: Learn The Basics Of Python Programming (Python Crash Course, Programming for Dummies)
James Tudor
5/5 (1)
Learn R By Coding
From Everand
Learn R By Coding
Thomas Kurnicki
No ratings yet
Update to Modern C++
From Everand
Update to Modern C++
James Raynard
No ratings yet
Algorithm Challenges: The Dojo Collection
From Everand
Algorithm Challenges: The Dojo Collection
Martin Puryear
No ratings yet
R coding for data analysts: from beginner to advanced
From Everand
R coding for data analysts: from beginner to advanced
Porcu Valentina
No ratings yet
Javascript: Javascript Programming For Absolute Beginners: Ultimate Guide To Javascript Coding, Javascript Programs And Javascript Language
From Everand
Javascript: Javascript Programming For Absolute Beginners: Ultimate Guide To Javascript Coding, Javascript Programs And Javascript Language
William Sullivan
3.5/5 (2)
JavaScript Introduction
From Everand
JavaScript Introduction
Lisa Saldivar
No ratings yet
Learn Python in 10 Minutes
From Everand
Learn Python in 10 Minutes
Victor Ebai
4/5 (30)
Easy Programming for Everyone
From Everand
Easy Programming for Everyone
Umar Asghar
No ratings yet
NLP FinAL
No ratings yet
NLP FinAL
27 pages
Essential Python 3
From Everand
Essential Python 3
Kevin Vans-Colina
No ratings yet
Python Programming Concepts
From Everand
Python Programming Concepts
MRB
No ratings yet
Python Handbook For Beginners. A Hands-On Crash Course For Kids, Newbies and Everybody Else
From Everand
Python Handbook For Beginners. A Hands-On Crash Course For Kids, Newbies and Everybody Else
Roman Gurbanov
No ratings yet
Session2 3
No ratings yet
Session2 3
18 pages
The Project Gutenberg RST Manual
From Everand
The Project Gutenberg RST Manual
Marcello Perathoner
No ratings yet
NCH DreamPlan Plus Crack 7.77 Full Version Free 2023-Shortcrack
No ratings yet
NCH DreamPlan Plus Crack 7.77 Full Version Free 2023-Shortcrack
8 pages
Python Pranks and Mischief with NLP
From Everand
Python Pranks and Mischief with NLP
Edward Franklin
No ratings yet
Lab-1 - Tokenization, Stemming, Stopwords - Jupyter Notebook
No ratings yet
Lab-1 - Tokenization, Stemming, Stopwords - Jupyter Notebook
15 pages
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
Coding for beginners The basic syntax and structure of coding
From Everand
Coding for beginners The basic syntax and structure of coding
Diamond Moore
No ratings yet
Dsbdal A7
No ratings yet
Dsbdal A7
65 pages
Microsoft - I - O Tests (Device Fundamentals)
No ratings yet
Microsoft - I - O Tests (Device Fundamentals)
3 pages
Objective-C Programming Nuts and bolts
From Everand
Objective-C Programming Nuts and bolts
Keith Lee
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
The Swift Codebook: A Beginner's Guide from Basics to Best Practices
From Everand
The Swift Codebook: A Beginner's Guide from Basics to Best Practices
Grace Huang
No ratings yet
The Nuclear Method for Smashwords Authors
From Everand
The Nuclear Method for Smashwords Authors
Emma Wayne Porter
No ratings yet
Mastering Python Basics: Python, #1
From Everand
Mastering Python Basics: Python, #1
AnwaarX
No ratings yet
File List
No ratings yet
File List
7 pages
PYTHON DATA SCIENCE FOR BEGINNERS: Unlock the Power of Data Science with Python and Start Your Journey as a Beginner (2023 Crash Course)
From Everand
PYTHON DATA SCIENCE FOR BEGINNERS: Unlock the Power of Data Science with Python and Start Your Journey as a Beginner (2023 Crash Course)
Rufus Johnston
No ratings yet
C++ for Beginners: The Complete Guide to Learn C++ Programming with Ease and Confidence
From Everand
C++ for Beginners: The Complete Guide to Learn C++ Programming with Ease and Confidence
Lena Neill
No ratings yet
Assignment No 1 - Genai Fa24-Msds-0007
No ratings yet
Assignment No 1 - Genai Fa24-Msds-0007
10 pages
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Mastering Python Programming: A Comprehensive Guide: The IT Collection
From Everand
Mastering Python Programming: A Comprehensive Guide: The IT Collection
Christopher Ford
5/5 (1)
WPI Log
No ratings yet
WPI Log
6 pages
Internet Explorer: Generic Sign-In
No ratings yet
Internet Explorer: Generic Sign-In
5 pages
Changes S
No ratings yet
Changes S
3,459 pages
Windows Mobility Center Quick Launch Button Does Not Work Post The Windows 8.1 Upgrade
No ratings yet
Windows Mobility Center Quick Launch Button Does Not Work Post The Windows 8.1 Upgrade
18 pages
Chapter 1. Introduction: List of Tables
No ratings yet
Chapter 1. Introduction: List of Tables
2 pages
Programming And Coding in Intermidiate Level
From Everand
Programming And Coding in Intermidiate Level
Memo
No ratings yet
NumPy Recipes
From Everand
NumPy Recipes
Martin McBride
No ratings yet
WPI Log
No ratings yet
WPI Log
6 pages
Assembly Language:Simple, Short, And Straightforward Way Of Learning Assembly Programming
From Everand
Assembly Language:Simple, Short, And Straightforward Way Of Learning Assembly Programming
Sherwyn Allibang
2/5 (1)
Get The Training You Need!: Excel PR O
No ratings yet
Get The Training You Need!: Excel PR O
10 pages
Mastering Python: A Comprehensive Guide for Beginners and Experts
From Everand
Mastering Python: A Comprehensive Guide for Beginners and Experts
Rick Spair
No ratings yet
423/723 Natural Language Processing: Assignment 1
No ratings yet
423/723 Natural Language Processing: Assignment 1
4 pages
Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps
From Everand
Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps
Jason Scotts
4/5 (55)
C Clearly - Programming With C In Linux and On Raspberry Pi
From Everand
C Clearly - Programming With C In Linux and On Raspberry Pi
Andrew Johnson
No ratings yet
Understanding Python: Beginner's Guide to Programming
From Everand
Understanding Python: Beginner's Guide to Programming
Sabry Fattah
No ratings yet
Macintosh Os X Tiger Keys
No ratings yet
Macintosh Os X Tiger Keys
4 pages
2.2.1.12 Lab - Windows Task Manager
No ratings yet
2.2.1.12 Lab - Windows Task Manager
11 pages
How Big Should My OS Drive Be
No ratings yet
How Big Should My OS Drive Be
3 pages
Kalipso TechDoc iOS
No ratings yet
Kalipso TechDoc iOS
20 pages
A Beginner's guide to Python
From Everand
A Beginner's guide to Python
Steven Mcananey
No ratings yet
Typescript Mini Reference: A Hitchhiker's Guide to the Modern Programming Languages, #4
From Everand
Typescript Mini Reference: A Hitchhiker's Guide to the Modern Programming Languages, #4
Harry Yoon
No ratings yet
Log
No ratings yet
Log
7 pages
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet
DL (Download Library) : ID-No - DLBT0800654EN00
No ratings yet
DL (Download Library) : ID-No - DLBT0800654EN00
3 pages
Capture Manager 1.3.0 ReadMe en
No ratings yet
Capture Manager 1.3.0 ReadMe en
8 pages
GFGC MAD LAB Manual 2023 24
No ratings yet
GFGC MAD LAB Manual 2023 24
29 pages
Com - Paimon.loader Logcat
No ratings yet
Com - Paimon.loader Logcat
4 pages
2024-08-10
No ratings yet
2024-08-10
63 pages
Crowd Response-Wps Office
No ratings yet
Crowd Response-Wps Office
6 pages
Citra Log
No ratings yet
Citra Log
9 pages
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
bpuwd3840ap6-LAB 31.1
No ratings yet
bpuwd3840ap6-LAB 31.1
2 pages
Log-2024 11 27 07 38
No ratings yet
Log-2024 11 27 07 38
6 pages
Programming in C | Step by Step: The Simple Beginner's Guide
From Everand
Programming in C | Step by Step: The Simple Beginner's Guide
M.Eng. Johannes Wild
No ratings yet
Seven Blue Edition
No ratings yet
Seven Blue Edition
1 page

Texttech Ex06 Solution

Uploaded by

Texttech Ex06 Solution

Uploaded by

Text Technology

Exercise Sheet 06:

For each of the three languages, do the following:

Task 2: POS Tagger

Task 3: Named Entity Recognition

Task 5: Analysing Parses II

1. Using a tokeniser on a URL such as https://www.reddit.com/r/EthicalLifeProTips/comments/

(taken from https://www.allrecipes.com/asparagus-and-eggs-recipe-8634304)

You might also like