LAB02
LAB02
What is NLTK?
Natural Language Toolkit (NLTK) is a leading platform for building Python programs to work with human
language data (Natural Language Processing). It is accompanied by a book that explains the underlying
concepts behind the language processing tasks supported by the toolkit. NLTK is intended to support
research and teaching in NLP or closely related areas, including empirical linguistics, cognitive science,
artificial intelligence, information retrieval, and machine learning.
http://www.nltk.org/install.html
http://www.nltk.org/data.html
http://www.tutorialspoint.com/python/python tutorial.pdf
Python overview
Basic syntax
Identifiers
Python identifier is a name used to identify a variable, function, class, module, or other object. An
identifier starts with a letter A to Z or a to z, or an underscore (_) followed by zero or more letters,
underscores and digits (0 to 9). Python does not allow punctuation characters such as @, $, and % within
identifiers. Python is a case sensitive programming language. Thus, Variable and variable are two
different identifiers in Python.
Python provides no braces to indicate blocks of code for class and function definitions or flow control.
Blocks of code are denoted by line indentation, which is rigidly enforced. The number of spaces in the
indentation is variable, but all statements within the block must be indented the same amount.
EPL 660 – Information Retrieval and Search Engines
Quotation
Python accepts single ('), double (") and triple (''' or """) quotes to denote string literals, as long as
the same type of quote starts and ends the string.
Examples:
word = 'word'
sentence = "This is a sentence."
paragraph = """This is a paragraph. It is made up of multiple lines and
sentences."""
• numbers;
• strings;
• lists;
• tuples;
• dictionaries.
Python variables do not need explicit declaration to reserve memory space. The declaration happens
automatically when you assign a value to a variable. The equal sign (=) is used to assign values to
variables. The operand to the left of the = operator is the name of the variable and the operand to the
right of the = operator is the value stored in the variable.
For example:
counter = 100 # An integer assignment
miles = 1000.0 # A floating point
name = "John" # A string
Lists
print(len([1, 2, 3])) # 3 - length
print([1, 2, 3] + [4, 5, 6]) # [1, 2, 3, 4, 5, 6] - concatenation
print(['Hi!'] * 4) # ['Hi!', 'Hi!', 'Hi!, 'Hi!'] - repetition
print(3 in [1, 2, 3]) # True - checks membership for x in [1, 2, 3]:
print(x) # 1 2 3 - iteration
Some of the useful built-in functions useful in work with lists are max, min, cmp, len, list (converts
tuple to list), etc. Some of the list-specific functions are list.append, list.extend, list.count,
etc.
EPL 660 – Information Retrieval and Search Engines
Tuples
Basic tuple operations are same as with lists: length, concatenation, repetition, membership and
iteration.
Dictionaries
List comprehension
Comprehensions are constructs that allow sequences to be built from other sequences. Python 2.0
introduced list comprehensions and Python 3.0 comes with dictionary and set comprehensions. The
following is the example:
a_list = [1, 2, 9, 3, 0, 4]
squared_ints = [e**2 for e in a_list]
print(squared_ints) # [ 1, 4, 81, 9, 0, 16 ]
a_list = [1, 2, 9, 3, 0, 4]
squared_ints = []
for e in a_list:
squared_ints.append(e**2)
print(squared_ints) # [ 1, 4, 81, 9, 0, 16 ]
Now, let’s see the example with if statement. The example shows how to filter out non integer types
from mixed list and apply operations.
However, if you want to include if else statement, the arrangement looks a bit different.
String handling
Other useful functions include join, split, count, capitalize, strip, upper, lower, etc.
IO handling
Python has two built-in functions for reading from standard input: raw_input and input.
File opening
One of the useful packages for handling tsv and csv files is csv library.
Functions
def functionname(parameters):
"function_docstring"
function_suite return [expression]
The given VM has Anaconda environment installed along with Python 3.8 and provides Spyder IDE for
writing and running python source code.
Create a file called list merge.py. Type the following code and fill in the missing parts (< ... >). Create a
dictionary result, where the keys are the values from some list, and values from some tuple. Use list
comprehension or standard loop.
Submission: Add the code where indicated and submit merge.py to Moodle.
Word tokenization: A sentence or data can be split into words using the method word tokenize():
[’All’, ’work’, ’and’, ’no’, ’play’, ’makes’, ’jack’, ’dull’, ’boy’, ’,’,
’all’, ’work’, ’and’, ’no’, ’play’]
All of them are words except the comma. Special characters are treated as separate tokens.
EPL 660 – Information Retrieval and Search Engines
Sentence tokenization: The same principle can be applied to sentences. Simply change the to
sent_tokenize() We have added two sentences to the variable data:
data = "All work and no play makes jack dull boy. All work and no play makes
jack a dull boy."
print(sent_tokenize(data))
Outputs:
['All work and no play makes jack dull boy.', 'All work and no play makes
jack a dull boy.']
If you wish to you can store the words and sentences in lists:
from nltk.tokenize import sent_tokenize, word_tokenize
data = "All work and no play makes jack dull boy. All work and no play makes
jack a dull boy."
phrases = sent_tokenize(data)
words = word_tokenize(data)
print(phrases)
print(words)
English text may contain stop words like ‘the’, ‘is’, ‘are’. Stop words can be filtered from the text to be
processed. There is no universal list of stop words in NLP research, however the NLTK module contains a
list of stop words. Now you will learn how to remove stop words using the NLTK. We start with the code
from the previous section with tokenized words.
data = "All work and no play makes jack dull boy. All work and no play makes
jack a dull boy."
stopWords = set(stopwords.words('english')) # a set of English stopwords
words = word_tokenize(data.lower())
wordsFiltered = []
for w in words:
EPL 660 – Information Retrieval and Search Engines
if w not in stopWords:
wordsFiltered.append(w)
<your code> # Print the number of stopwords
<your code> # Print the stopwords
<your code> # Print the filtered text ['work', 'play',
'makes', 'jack', 'dull', 'boy', '.', 'work', 'play', 'makes', 'jack', 'dull',
'boy', '.']
Submission: Create a file named stop_word_removal.py with the previous code snippet and submit
it to Moodle.
Step 4. Stemming
A word stem is part of a word. It is sort of a normalization idea, but linguistic. For example, the stem of
the word waiting is wait. Given words, NLTK can find the stems. Start by defining some words:
words = ["game","gaming","gamed","games"]
We import the module:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
There are more stemming algorithms, but Porter (PorterStemer) is the most popular.
Step 5. n-grams
Word n-grams
Character n-grams
Now, we will use the NLTK corpus module to read the corpus austen-persuasion.txt, included in the
Gutenberg corpus collection, and answer the following questions:
Before we proceed with answering these questions, we will describe an NLTK built-in class which can
help us to get the answers in a simple way.
FreqDist
When dealing with a classification task, one may ask how can we automatically identify the words of a
text that are most informative about the topic and genre of the text? One method would be to keep a
tally for each vocabulary item. This is known as a frequency distribution, and it tells us the frequency of
each vocabulary item in the text. It is a “distribution” because it tells us how the total number of word
tokens in the text are distributed across the vocabulary items. NLTK automates this through FreqDist.
Example:
data = "All work and no play makes jack dull boy. All work and no play makes
jack a dull boy."
words = word_tokenize(data)
fdist1 = FreqDist(words)
print(fdist1.most_common(2)) # Prints two most common tokens
print(fdist1.hapaxes()) # Prints tokens with frequency 1
For the following code snippet fill in the comments with the answers where indicated. For the third
question you are asked to report third most common token.
Submission: Create a file explore_corpus.py with the previous code snippet and submit to Moodle.
In the previous example we have explored corpus, which, you may have noticed, was imported from
nltk.corpus. NLTK offers a package of pre-trained, labeled corpora for different purposes. In this section
we will do a simple classification task of movie reviews. The corpus is taken from nltk.corpus.movie
reviews. The classifier will be NaiveBayesClassifier. Create a file movie_rev_classifier.py
with the following code. Run the code 3 times and report the accuracy for each run. Explain why each
time we got different accuracy. Write the comments below the code snippet as a python comment.
from nltk import FreqDist, NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk.classify import accuracy
import random
documents = [(list(movie_reviews.words(fileid)), category)
EPL 660 – Information Retrieval and Search Engines
In Python, the Inverted Index can be understood as a simple key/value dictionary where per each term
(key) we store a list of appearances of those terms in the documents and their frequency. For example,
the following inverted index:
contains two terms (carpet and troop) in the lexicon. Each term is connected to its posting list which
contains the document names along with the frequency of the term in each document.
EPL 660 – Information Retrieval and Search Engines
The lexicon can be implemented as a python dictionary with key/value pairs where each term can be
seen as a key and the term’s posting list as a value. The posting list can be also implemented as a python
dictionary with filenames and frequencies as key/value pairs.
We provide a python file called inverted_index.py which reads all documents within the Gutenberg
corpus and builds an inverted index (lexicon and posting lists). Then, it executes two queries:
• Query 1: “carpet AND troops” which returns all documents that contain both terms. To find
all matching documents using inverted index:
o Locate carpet in the dictionary
o Retrieve the documents in its postings list (get the keys of the posting list)
o Locate troop in the dictionary
o Retrieve the documents in its postings list (get the keys of the posting list)
o Intersect the two sets
o Return intersection
• Query 2: “carpet AND troop AND NOT overburden” which returns all documents that
contain both terms carpet and troop but not overburden. To find all matching documents using
inverted index:
o Locate carpet in the dictionary
o Retrieve the documents in its postings list (get the keys of the posting list)
o Locate troop in the dictionary
o Retrieve the documents in its postings list (get the keys of the posting list)
o Locate overburden in the dictionary
o Retrieve the documents in its postings list (get the keys of the posting list)
o Intersect the set of documents of carpet and troop
o Compute difference of the intersection with the overburden documents
o Return resulting list
You are required to fill the source code where needed (replace the word None).
Submission
• merge.py
• stop_word_removal.py
• explore_corpus.py
• movie_rev_classifier.py
• inverted_index.py