0% found this document useful (0 votes)

45 views11 pages

LAB02

The document discusses an NLTK lab on natural language processing using Python. It provides an overview of the lab and covers Python basics like syntax, data types, and functions. It also discusses NLTK topics like tokenization, stop word removal, and running code on a virtual machine.

Uploaded by

mausam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views11 pages

LAB02

Uploaded by

mausam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

EPL 660 – Information Retrieval and Search Engines

Lab 2: Natural Language Processing using Python NLTK

Lab Overview

What is NLTK?

Natural Language Toolkit (NLTK) is a leading platform for building Python programs to work with human
language data (Natural Language Processing). It is accompanied by a book that explains the underlying
concepts behind the language processing tasks supported by the toolkit. NLTK is intended to support
research and teaching in NLP or closely related areas, including empirical linguistics, cognitive science,
artificial intelligence, information retrieval, and machine learning.

For installation instructions on your local machine, please refer to:

http://www.nltk.org/install.html

http://www.nltk.org/data.html

For a simple beginner Python tutorial take a look at:

http://www.tutorialspoint.com/python/python tutorial.pdf

In this lab we will explore:

• Python quick overview;

• Lexical analysis: Word and text tokenizer;
• n-gram and collocations;
• NLTK corpora;
• Naive Bayes / Decision tree classifier with NLTK.
• Inverted index implementation

Python overview

Basic syntax

Identifiers

Python identifier is a name used to identify a variable, function, class, module, or other object. An
identifier starts with a letter A to Z or a to z, or an underscore (_) followed by zero or more letters,
underscores and digits (0 to 9). Python does not allow punctuation characters such as @, $, and % within
identifiers. Python is a case sensitive programming language. Thus, Variable and variable are two
different identifiers in Python.

Lines and Indentation

Python provides no braces to indicate blocks of code for class and function definitions or flow control.
Blocks of code are denoted by line indentation, which is rigidly enforced. The number of spaces in the
indentation is variable, but all statements within the block must be indented the same amount.
EPL 660 – Information Retrieval and Search Engines

Quotation

Python accepts single ('), double (") and triple (''' or """) quotes to denote string literals, as long as
the same type of quote starts and ends the string.

Examples:

word = 'word'
sentence = "This is a sentence."
paragraph = """This is a paragraph. It is made up of multiple lines and
sentences."""

Data types, assigning and deleting values

Python has five standard data types:

• numbers;
• strings;
• lists;
• tuples;
• dictionaries.

Python variables do not need explicit declaration to reserve memory space. The declaration happens
automatically when you assign a value to a variable. The equal sign (=) is used to assign values to
variables. The operand to the left of the = operator is the name of the variable and the operand to the
right of the = operator is the value stored in the variable.

For example:
counter = 100 # An integer assignment
miles = 1000.0 # A floating point
name = "John" # A string

Lists
print(len([1, 2, 3])) # 3 - length
print([1, 2, 3] + [4, 5, 6]) # [1, 2, 3, 4, 5, 6] - concatenation
print(['Hi!'] * 4) # ['Hi!', 'Hi!', 'Hi!, 'Hi!'] - repetition
print(3 in [1, 2, 3]) # True - checks membership for x in [1, 2, 3]:
print(x) # 1 2 3 - iteration

Some of the useful built-in functions useful in work with lists are max, min, cmp, len, list (converts
tuple to list), etc. Some of the list-specific functions are list.append, list.extend, list.count,
etc.
EPL 660 – Information Retrieval and Search Engines

Tuples

tup1 = ('physics', 'chemistry', 1997, 2000)

tup2 = (1, 2, 3, 4, 5, 6, 7)
print(tup1[0]) # prints: physics print(tup2[1:5]) # prints: [2, 3, 4, 5]

Basic tuple operations are same as with lists: length, concatenation, repetition, membership and
iteration.

Dictionaries

dict = {'Name':'Zara', 'Age':7, 'Class':'First'}

dict['Age'] = 8 # update existing entry
dict['School'] = "DPS School" # Add new entry
del dict['School'] # Delete existing entry

List comprehension

Comprehensions are constructs that allow sequences to be built from other sequences. Python 2.0
introduced list comprehensions and Python 3.0 comes with dictionary and set comprehensions. The
following is the example:

a_list = [1, 2, 9, 3, 0, 4]
squared_ints = [e**2 for e in a_list]
print(squared_ints) # [ 1, 4, 81, 9, 0, 16 ]

This is same as:

a_list = [1, 2, 9, 3, 0, 4]
squared_ints = []
for e in a_list:
squared_ints.append(e**2)
print(squared_ints) # [ 1, 4, 81, 9, 0, 16 ]

Now, let’s see the example with if statement. The example shows how to filter out non integer types
from mixed list and apply operations.

a_list = [1, '4', 9, 'a', 0, 4]

squared_ints = [ e**2 for e in a_list if type(e) is int ]
print(squared_ints) # [ 1, 81, 0, 16 ]

However, if you want to include if else statement, the arrangement looks a bit different.

a_list = [1, ’4’, 9, ’a’, 0, 4]

EPL 660 – Information Retrieval and Search Engines

squared_ints = [ e**2 if type(e) is int else 'x' for e in a_list]

print(squared_ints) # [1, 'x', 81, 'x', 0, 16]

You can also generate dictionary using list comprehension:

a_list = ["I", "am", "a", "data", "scientist"]

science_list = { e:i for i, e in enumerate(a_list) }
print(science_list) # {'I': 0, 'am': 1, 'a': 2, 'data': 3, 'scientist': 4}

... or list of tuples:

a_list = ["I", "am", "a", "data", "scientist"]

science_list = [ (e,i) for i, e in enumerate(a_list) ]
print(science_list) # [('I', 0), ('am', 1), ('a', 2), ('data', 3),
('scientist’, 4)]

String handling

Examples with string operations:

str = 'Hello World!'

print(str) # Prints complete string
print(str[0]) # Prints first character of the string
print(str[2:5]) # Prints characters starting from 3rd to 5th
print(str[2:]) # Prints string starting from 3rd character
print(str*2) # Prints string two times
print(str + "TEST") # Prints concatenated string

Other useful functions include join, split, count, capitalize, strip, upper, lower, etc.

Example of string formatting:

print("My name is %s and age is %d!" % ('Zara',21))

IO handling

Python has two built-in functions for reading from standard input: raw_input and input.

str = raw_input("Enter your input: ")

print("Received input is : ", str)

File opening

To handle files in Python, you can use function open. Syntax:

file object = open(file_name [, access_mode][, buffering])
EPL 660 – Information Retrieval and Search Engines

One of the useful packages for handling tsv and csv files is csv library.

Functions

An example how to define a function in Python:

def functionname(parameters):
"function_docstring"
function_suite return [expression]

Running your code on VM using Spyder IDE

Step 1. Login to VM and open Spyder IDE

The given VM has Anaconda environment installed along with Python 3.8 and provides Spyder IDE for
writing and running python source code.

Step 2. Python list, tuple and dictionary example

Create a file called list merge.py. Type the following code and fill in the missing parts (< ... >). Create a
dictionary result, where the keys are the values from some list, and values from some tuple. Use list
comprehension or standard loop.

some_list = ["first_name", "last_name", "age", "occupation"]

some_tuple = ("John", "Holloway", 35, "carpenter")
result = <your code>
print(result)
# {'first_name': 'John', 'last_name': 'Holloway', 'age': 35, 'occupation':
'carpenter'}

Submission: Add the code where indicated and submit merge.py to Moodle.

Step 2. Lexical Analysis: tokenization

Word tokenization: A sentence or data can be split into words using the method word tokenize():

from nltk.tokenize import sent_tokenize, word_tokenize

data = "All work and no play makes jack a dull boy, all work and no play"
print(word_tokenize(data))

This will output:

[’All’, ’work’, ’and’, ’no’, ’play’, ’makes’, ’jack’, ’dull’, ’boy’, ’,’,
’all’, ’work’, ’and’, ’no’, ’play’]

All of them are words except the comma. Special characters are treated as separate tokens.
EPL 660 – Information Retrieval and Search Engines

Sentence tokenization: The same principle can be applied to sentences. Simply change the to
sent_tokenize() We have added two sentences to the variable data:

from nltk.tokenize import sent_tokenize, word_tokenize

data = "All work and no play makes jack dull boy. All work and no play makes
jack a dull boy."
print(sent_tokenize(data))

Outputs:

['All work and no play makes jack dull boy.', 'All work and no play makes
jack a dull boy.']

Storing words and sentences in lists

If you wish to you can store the words and sentences in lists:
from nltk.tokenize import sent_tokenize, word_tokenize
data = "All work and no play makes jack dull boy. All work and no play makes
jack a dull boy."
phrases = sent_tokenize(data)
words = word_tokenize(data)
print(phrases)
print(words)

Step 3. Stop word removal

English text may contain stop words like ‘the’, ‘is’, ‘are’. Stop words can be filtered from the text to be
processed. There is no universal list of stop words in NLP research, however the NLTK module contains a
list of stop words. Now you will learn how to remove stop words using the NLTK. We start with the code
from the previous section with tokenized words.

from nltk.tokenize import sent_tokenize, word_tokenize

from nltk.corpus import stopwords # We imported auxiliary corpus
# provided with NLTK

data = "All work and no play makes jack dull boy. All work and no play makes
jack a dull boy."
stopWords = set(stopwords.words('english')) # a set of English stopwords
words = word_tokenize(data.lower())
wordsFiltered = []
for w in words:
EPL 660 – Information Retrieval and Search Engines

if w not in stopWords:
wordsFiltered.append(w)
<your code> # Print the number of stopwords
<your code> # Print the stopwords
<your code> # Print the filtered text ['work', 'play',
'makes', 'jack', 'dull', 'boy', '.', 'work', 'play', 'makes', 'jack', 'dull',
'boy', '.']

Submission: Create a file named stop_word_removal.py with the previous code snippet and submit
it to Moodle.

Step 4. Stemming

A word stem is part of a word. It is sort of a normalization idea, but linguistic. For example, the stem of
the word waiting is wait. Given words, NLTK can find the stems. Start by defining some words:

words = ["game","gaming","gamed","games"]
We import the module:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

And stem the words in the list using:

from nltk.stem import PorterStemmer

from nltk.tokenize import sent_tokenize, word_tokenize
words = ["game","gaming","gamed","games"]
ps = PorterStemmer()
for word in words:
print(ps.stem(word))

You can do word stemming for sentences too:

from nltk.stem import PorterStemmer

from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
sentence = "gaming, the gamers play games"
words = word_tokenize(sentence)
for word in words:
print(word + ":" + ps.stem(word))
EPL 660 – Information Retrieval and Search Engines

There are more stemming algorithms, but Porter (PorterStemer) is the most popular.

Step 5. n-grams

Word n-grams

from nltk import ngrams sentence = "This is my sentence and I want to

ngramize it."
n = 6
w_6grams = ngrams(sentence.split(), n)
for grams in w_6grams:
print(grams)

Character n-grams

from nltk import ngrams sentence = "This is my sentence and I want to

ngramize it."
n = 6
c_6grams = ngrams(sentence, n)
for grams in c_6grams:
print(''.join(grams))

Step 6. Exploring corpora

Now, we will use the NLTK corpus module to read the corpus austen-persuasion.txt, included in the
Gutenberg corpus collection, and answer the following questions:

• How many total words does this corpus have?

• How many unique words does this corpus have?
• What are the counts for the 10 most frequent words?

Before we proceed with answering these questions, we will describe an NLTK built-in class which can
help us to get the answers in a simple way.

FreqDist

When dealing with a classification task, one may ask how can we automatically identify the words of a
text that are most informative about the topic and genre of the text? One method would be to keep a
tally for each vocabulary item. This is known as a frequency distribution, and it tells us the frequency of
each vocabulary item in the text. It is a “distribution” because it tells us how the total number of word
tokens in the text are distributed across the vocabulary items. NLTK automates this through FreqDist.
Example:

from nltk import FreqDist

from nltk.tokenize import word_tokenize
EPL 660 – Information Retrieval and Search Engines

data = "All work and no play makes jack dull boy. All work and no play makes
jack a dull boy."
words = word_tokenize(data)
fdist1 = FreqDist(words)
print(fdist1.most_common(2)) # Prints two most common tokens
print(fdist1.hapaxes()) # Prints tokens with frequency 1

For the following code snippet fill in the comments with the answers where indicated. For the third
question you are asked to report third most common token.

from nltk.corpus import gutenberg

from nltk import FreqDist
# Count each token in austen-persuasion.txt of the Gutenberg collection
list_of_words = gutenberg.words("austen-persuasion.txt")
fd = FreqDist(list_of_words) # Frequency distribution object
print("Total number of tokens: " + str(fd.N())) # <insert_comment_how_many>
print("Number of unique tokens: " + str(fd.B())) # <insert_comment_how_many>
print("Top 10 tokens:") # <insert_comment_which_is_third token>
for token, freq in fd.most_common(10):
print(token + "\t" + str(freq))

To find out more about FreqDist refer to http://www.nltk.org/book/ch01.html section 3.1.

Submission: Create a file explore_corpus.py with the previous code snippet and submit to Moodle.

Step 7. Document Classification

In the previous example we have explored corpus, which, you may have noticed, was imported from
nltk.corpus. NLTK offers a package of pre-trained, labeled corpora for different purposes. In this section
we will do a simple classification task of movie reviews. The corpus is taken from nltk.corpus.movie
reviews. The classifier will be NaiveBayesClassifier. Create a file movie_rev_classifier.py
with the following code. Run the code 3 times and report the accuracy for each run. Explain why each
time we got different accuracy. Write the comments below the code snippet as a python comment.
from nltk import FreqDist, NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk.classify import accuracy
import random
documents = [(list(movie_reviews.words(fileid)), category)
EPL 660 – Information Retrieval and Search Engines

for category in movie_reviews.categories()

for fileid in movie_reviews.fileids(category)]
random.shuffle(documents) # This line shuffles the order of the documents
all_words = FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains({})'.format(word)] = (word in document_words)
return features
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100] # Split data to
train and test set
classifier = NaiveBayesClassifier.train(train_set)
print(accuracy(classifier, test_set))
# <answer_area>
# <answer_area>
# <answer_area>

Submission: Submit the file movie_rev_classifier.py to Moodle.

Step 8. Creating inverted Index

In Python, the Inverted Index can be understood as a simple key/value dictionary where per each term
(key) we store a list of appearances of those terms in the documents and their frequency. For example,
the following inverted index:

carpet {'chesterton-brown.txt': 5, 'edgeworth-parents.txt': 4}

troop {'austen-sense.txt': 1, 'bible-kjv.txt': 20}

contains two terms (carpet and troop) in the lexicon. Each term is connected to its posting list which
contains the document names along with the frequency of the term in each document.
EPL 660 – Information Retrieval and Search Engines

The lexicon can be implemented as a python dictionary with key/value pairs where each term can be
seen as a key and the term’s posting list as a value. The posting list can be also implemented as a python
dictionary with filenames and frequencies as key/value pairs.

We provide a python file called inverted_index.py which reads all documents within the Gutenberg
corpus and builds an inverted index (lexicon and posting lists). Then, it executes two queries:

• Query 1: “carpet AND troops” which returns all documents that contain both terms. To find
all matching documents using inverted index:
o Locate carpet in the dictionary
o Retrieve the documents in its postings list (get the keys of the posting list)
o Locate troop in the dictionary
o Retrieve the documents in its postings list (get the keys of the posting list)
o Intersect the two sets
o Return intersection
• Query 2: “carpet AND troop AND NOT overburden” which returns all documents that
contain both terms carpet and troop but not overburden. To find all matching documents using
inverted index:
o Locate carpet in the dictionary
o Retrieve the documents in its postings list (get the keys of the posting list)
o Locate troop in the dictionary
o Retrieve the documents in its postings list (get the keys of the posting list)
o Locate overburden in the dictionary
o Retrieve the documents in its postings list (get the keys of the posting list)
o Intersect the set of documents of carpet and troop
o Compute difference of the intersection with the overburden documents
o Return resulting list

You are required to fill the source code where needed (replace the word None).

Submission: Submit the file inverted_index.py to Moodle.

Submission

Submit to Moodle the following files:

• merge.py
• stop_word_removal.py
• explore_corpus.py
• movie_rev_classifier.py
• inverted_index.py

by 1st of October 2020 at 15:00.

CipherTrust Manager - Hands-On - CTE - Linux
0% (1)
CipherTrust Manager - Hands-On - CTE - Linux
25 pages
Practical - Introduction - To - Python - UG Class
No ratings yet
Practical - Introduction - To - Python - UG Class
84 pages
Basics of Python Programming
No ratings yet
Basics of Python Programming
56 pages
Python Notes 2022
100% (1)
Python Notes 2022
155 pages
Python in One Video
No ratings yet
Python in One Video
51 pages
Python Job Level Material
No ratings yet
Python Job Level Material
202 pages
Lab 1 - Python Review
No ratings yet
Lab 1 - Python Review
7 pages
Weeks 4 To 7
No ratings yet
Weeks 4 To 7
155 pages
Python Programming
No ratings yet
Python Programming
151 pages
Machine Learning With Python
No ratings yet
Machine Learning With Python
86 pages
Python Module 2
No ratings yet
Python Module 2
90 pages
An Introduction To Python Programming Language
No ratings yet
An Introduction To Python Programming Language
63 pages
2 Python
No ratings yet
2 Python
131 pages
Introduction To Python: Irandufa Indebu
No ratings yet
Introduction To Python: Irandufa Indebu
87 pages
Patran Contact Pairs Settings Web of MSC Nastran 3d Contact Video Series Mica
No ratings yet
Patran Contact Pairs Settings Web of MSC Nastran 3d Contact Video Series Mica
20 pages
Programmingwithpython
No ratings yet
Programmingwithpython
200 pages
U2 of Python
No ratings yet
U2 of Python
71 pages
Unit - Ii
No ratings yet
Unit - Ii
90 pages
Data Science
No ratings yet
Data Science
49 pages
Python - Learn Data Analytics Together's Group
No ratings yet
Python - Learn Data Analytics Together's Group
71 pages
Python Basics PPT UL
No ratings yet
Python Basics PPT UL
52 pages
Q-Step WS 02102019 Practical Introduction To Python
No ratings yet
Q-Step WS 02102019 Practical Introduction To Python
88 pages
Babaoskag
No ratings yet
Babaoskag
76 pages
Numpy and Matplotlib: Purushothaman.V.N March 10, 2011
No ratings yet
Numpy and Matplotlib: Purushothaman.V.N March 10, 2011
27 pages
DS Lab Manual
No ratings yet
DS Lab Manual
110 pages
Python Presentation 1
No ratings yet
Python Presentation 1
64 pages
CHP 2: Python
No ratings yet
CHP 2: Python
65 pages
UNIT 2 (Python) - Print
No ratings yet
UNIT 2 (Python) - Print
35 pages
Introduction of Python in Machine Learning
No ratings yet
Introduction of Python in Machine Learning
70 pages
CSE1021 - Python
No ratings yet
CSE1021 - Python
28 pages
Python Tutorial
No ratings yet
Python Tutorial
173 pages
AI Lab2
No ratings yet
AI Lab2
28 pages
PPL Lab Manual 2022-23
No ratings yet
PPL Lab Manual 2022-23
40 pages
Chapter-3 Data Handling
No ratings yet
Chapter-3 Data Handling
33 pages
0.1 - Review On The Python Programming Language
No ratings yet
0.1 - Review On The Python Programming Language
60 pages
Introduction To Python (Part II)
No ratings yet
Introduction To Python (Part II)
29 pages
Python Code and Theory
No ratings yet
Python Code and Theory
25 pages
Note by KK 1
No ratings yet
Note by KK 1
11 pages
Python 2
No ratings yet
Python 2
45 pages
Introduction To Python
No ratings yet
Introduction To Python
16 pages
CSC101 Lab Handout 11
No ratings yet
CSC101 Lab Handout 11
12 pages
Python Unit - I - Complete
No ratings yet
Python Unit - I - Complete
19 pages
Introduction To Python
No ratings yet
Introduction To Python
54 pages
Python Unit II Notes
No ratings yet
Python Unit II Notes
31 pages
Python Introduction
No ratings yet
Python Introduction
29 pages
Python I: Some Material Adapted From Upenn Cmpe391 Slides and Other Sources
No ratings yet
Python I: Some Material Adapted From Upenn Cmpe391 Slides and Other Sources
68 pages
Deep Learning
No ratings yet
Deep Learning
189 pages
Lab 2 Manual
No ratings yet
Lab 2 Manual
11 pages
Python - Coding The Matrix
No ratings yet
Python - Coding The Matrix
20 pages
305 Prep Azure
No ratings yet
305 Prep Azure
118 pages
Earthquake Final Report
No ratings yet
Earthquake Final Report
42 pages
DWM Experiment 1
No ratings yet
DWM Experiment 1
9 pages
Unit 1
No ratings yet
Unit 1
7 pages
Lab1 - Introduction To Python
No ratings yet
Lab1 - Introduction To Python
9 pages
05HiMAP MC (1230)
No ratings yet
05HiMAP MC (1230)
20 pages
Python Download Pip: "Hello, World!"
No ratings yet
Python Download Pip: "Hello, World!"
14 pages
Introduction To Python Language
No ratings yet
Introduction To Python Language
41 pages
Discrete Structures Lab 1 Python Basics: 1 Python Installation 2 Python Tutorial
No ratings yet
Discrete Structures Lab 1 Python Basics: 1 Python Installation 2 Python Tutorial
8 pages
Python Programming
No ratings yet
Python Programming
13 pages
Python Atb
No ratings yet
Python Atb
17 pages
Python2 Cheat Sheet v2
No ratings yet
Python2 Cheat Sheet v2
2 pages
WELCOME TO 51talk!: Teacher Name
No ratings yet
WELCOME TO 51talk!: Teacher Name
3 pages
Python Crash Course
No ratings yet
Python Crash Course
12 pages
Pro E Fundamentals Overview
No ratings yet
Pro E Fundamentals Overview
12 pages
Microservice
No ratings yet
Microservice
2 pages
Chapter 3. Constraints: User-Defined Data Types (Udts)
No ratings yet
Chapter 3. Constraints: User-Defined Data Types (Udts)
3 pages
Group4 Asm Final Nguyenviettien BH00785
No ratings yet
Group4 Asm Final Nguyenviettien BH00785
154 pages
T-GCPBDML-B - M0 - Course Introduction - ILT Slides
No ratings yet
T-GCPBDML-B - M0 - Course Introduction - ILT Slides
16 pages
The Java Technology Phenomenon
No ratings yet
The Java Technology Phenomenon
5 pages
Ling571 Class12 Sem SRL
No ratings yet
Ling571 Class12 Sem SRL
136 pages
1612644858
No ratings yet
1612644858
33 pages
The Next Generation in Business Process Automation Software Animalworlds - VN
No ratings yet
The Next Generation in Business Process Automation Software Animalworlds - VN
2 pages
Probabilistic Topic Models
No ratings yet
Probabilistic Topic Models
78 pages
PCFG
No ratings yet
PCFG
79 pages
Eurolan 07 Wiebe
No ratings yet
Eurolan 07 Wiebe
220 pages
Ling571 Class14 Distr Thes
No ratings yet
Ling571 Class14 Distr Thes
122 pages
Bi Admin Guide Setup
No ratings yet
Bi Admin Guide Setup
35 pages
Topoc Modeling PDF
No ratings yet
Topoc Modeling PDF
120 pages
GUNA1
No ratings yet
GUNA1
46 pages
12 Subrata DL
No ratings yet
12 Subrata DL
25 pages
Ling571 Class13 Lex Sem
No ratings yet
Ling571 Class13 Lex Sem
82 pages
18 Rnns
No ratings yet
18 Rnns
57 pages
Skills Matrix - RQ00356
No ratings yet
Skills Matrix - RQ00356
22 pages
1624338842
No ratings yet
1624338842
53 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
15 pages
255986458
No ratings yet
255986458
44 pages
Explain Following CSS Properties
No ratings yet
Explain Following CSS Properties
8 pages
VSB Java Syllabus
No ratings yet
VSB Java Syllabus
4 pages
Final 50
No ratings yet
Final 50
20 pages
2AE Series Differential Protection Relay Operation Manual
No ratings yet
2AE Series Differential Protection Relay Operation Manual
36 pages
1507965390
No ratings yet
1507965390
17 pages
Syllabus
No ratings yet
Syllabus
6 pages
FL - Slicex - Wave Editor
No ratings yet
FL - Slicex - Wave Editor
2 pages
1 s2.0 S016516841630295X Main
No ratings yet
1 s2.0 S016516841630295X Main
12 pages
Primavera P6 Training-Ch2
No ratings yet
Primavera P6 Training-Ch2
21 pages
Whats New
No ratings yet
Whats New
5 pages
Saidu-Musa-Cv Verified 2024
No ratings yet
Saidu-Musa-Cv Verified 2024
2 pages
Paper6 PDF
No ratings yet
Paper6 PDF
12 pages
A Joint Representation of Rényi's and Tsalli's Entropy With Application in Coding Theory - 2017 - International Journal of Mathematics A
No ratings yet
A Joint Representation of Rényi's and Tsalli's Entropy With Application in Coding Theory - 2017 - International Journal of Mathematics A
6 pages
Java External
No ratings yet
Java External
26 pages
Computer Classes in Jaipur
No ratings yet
Computer Classes in Jaipur
8 pages
Module-5 Structure, Union, Pointers and Preprocessor Directives
No ratings yet
Module-5 Structure, Union, Pointers and Preprocessor Directives
12 pages
Cadence Janus NoC - White Paper - Cadence
No ratings yet
Cadence Janus NoC - White Paper - Cadence
6 pages
OmniNote Final
No ratings yet
OmniNote Final
6 pages
PDF Metadata - Document Capture - Recherche Google
No ratings yet
PDF Metadata - Document Capture - Recherche Google
4 pages
Cloud-Services-Developer - Template 10
No ratings yet
Cloud-Services-Developer - Template 10
1 page
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet

LAB02

Uploaded by

LAB02

Uploaded by

EPL 660 – Information Retrieval and Search Engines

Lab 2: Natural Language Processing using Python NLTK

For installation instructions on your local machine, please refer to:

For a simple beginner Python tutorial take a look at:

In this lab we will explore:

• Python quick overview;

Lines and Indentation

Data types, assigning and deleting values

Python has five standard data types:

tup1 = ('physics', 'chemistry', 1997, 2000)

dict = {'Name':'Zara', 'Age':7, 'Class':'First'}

This is same as:

a_list = [1, '4', 9, 'a', 0, 4]

a_list = [1, ’4’, 9, ’a’, 0, 4]

squared_ints = [ e**2 if type(e) is int else 'x' for e in a_list]

You can also generate dictionary using list comprehension:

a_list = ["I", "am", "a", "data", "scientist"]

... or list of tuples:

a_list = ["I", "am", "a", "data", "scientist"]

Examples with string operations:

str = 'Hello World!'

Example of string formatting:

print("My name is %s and age is %d!" % ('Zara',21))

str = raw_input("Enter your input: ")

To handle files in Python, you can use function open. Syntax:

An example how to define a function in Python:

Running your code on VM using Spyder IDE

Step 1. Login to VM and open Spyder IDE

Step 2. Python list, tuple and dictionary example

some_list = ["first_name", "last_name", "age", "occupation"]

Step 2. Lexical Analysis: tokenization

from nltk.tokenize import sent_tokenize, word_tokenize

This will output:

from nltk.tokenize import sent_tokenize, word_tokenize

Storing words and sentences in lists

Step 3. Stop word removal

from nltk.tokenize import sent_tokenize, word_tokenize

And stem the words in the list using:

from nltk.stem import PorterStemmer

You can do word stemming for sentences too:

from nltk.stem import PorterStemmer

from nltk import ngrams sentence = "This is my sentence and I want to

from nltk import ngrams sentence = "This is my sentence and I want to

Step 6. Exploring corpora

• How many total words does this corpus have?

from nltk import FreqDist

from nltk.corpus import gutenberg

To find out more about FreqDist refer to http://www.nltk.org/book/ch01.html section 3.1.

Step 7. Document Classification

for category in movie_reviews.categories()

Submission: Submit the file movie_rev_classifier.py to Moodle.

Step 8. Creating inverted Index

carpet {'chesterton-brown.txt': 5, 'edgeworth-parents.txt': 4}

Submission: Submit the file inverted_index.py to Moodle.

Submit to Moodle the following files:

by 1st of October 2020 at 15:00.

You might also like