0% found this document useful (0 votes)
101 views64 pages

Natural Language Processing: Practical 1

The document provides instructions for removing stop words from text using the Natural Language Toolkit (NLTK) in Python. It first discusses what stop words are and some common examples. It then outlines the steps to remove stop words using NLTK, which includes installing NLTK, importing it, opening and reading a text file, getting a list of stop words, removing them from the text, and printing the modified output. The document thus provides a concise tutorial for using NLTK to remove stop words from text in a few simple steps.

Uploaded by

hamza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views64 pages

Natural Language Processing: Practical 1

The document provides instructions for removing stop words from text using the Natural Language Toolkit (NLTK) in Python. It first discusses what stop words are and some common examples. It then outlines the steps to remove stop words using NLTK, which includes installing NLTK, importing it, opening and reading a text file, getting a list of stop words, removing them from the text, and printing the modified output. The document thus provides a concise tutorial for using NLTK to remove stop words from text in a few simple steps.

Uploaded by

hamza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Natural Language Processing

Practical 1 BS(SE) 5th Semester

Dr. Nayyar Iqbal, Lecturer Department of Computer Science


Task
• Open Text File in Python
• Check the Contents of File
• Detect and Erase URLs
• Check if URLs are removed or not
• Save it as a New File
Open Text File in Python
• Create a “Nayyar.txt” file using Notepad
• Save the “Nayyar.txt” file on hard disk drive
• Open text file using open() function
• file = open(“C:\\Users\\HP\\Desktop\\Practical NLP\\Nayyar.txt”,”r”)

• First parameter is “Path of file containing drive, folder, sub-folder, filename and
extension”
• Second parameter is “mode of opening file”
• “r” is read-only
Check the Contents of File
• Save contents of a file in a variable by reading the file contents

• read() function reads the contents of file from start to end and stores in text variable

• Print the contents of a file

• print() function prints the contents line by line


Detect and Erase URLs
• We need regular expression library: re
• We need to import it using: import re

• Use re.sub() function to replace HTTP, HTTPs URLs with blank space and store
the result in a separate variable

• First parameter is regular expression for detecting URLs with http/https


• Second parameter is string to replace URLs with blank space
• Third parameter is contents of text file
• Fourth parameter is flag
Check if contents are removed or not…
• Use print() to display contents after removing URLs
Save it as a New File
• Open a new file (that you want to save, name it)

• open() method opens new file with “w+” right

• Write contents to the new file

• Close the newly created file to save it on disk


Complete Code
Natural Language Processing
Practical 2 BS(SE) 5th Semester

Dr. Nayyar Iqbal, Lecturer Department of Computer Science


Stem or Lemmatize Words
Python Practical 6
Stemming
• Process of reducing a word to its word stem
• Important in natural language understanding (NLU) and natural
language processing (NLP)
• Stemming is also a part of queries and Internet search engines
• Example:
• Words = Used, User, Using, Usable
• Stem = USE
Stemming and Lemmatization
• Stemming and Lemmatization are Text Normalization/Word
Normalization
• Used in Natural Language Processing that are used to prepare text,
words, and documents for further processing
How to do it in Python?
• Using NLTK [Natural Language Tool Kit]
• Steps:
• Download and Install NLTK
• Stemming Words using PorterStemmer and LancasterStemmer
• Stemming Sentences (with TOKENIZATION)
• Stemming Documents
Step 1-Download and Install NLTK
• NLTK stands for Natural Language Toolkit
• This is a suite of libraries and programs for statistical natural language processing
for English written in Python
• Open command prompt as Administrator
• Use command pip install NLTK in command prompt
Step 1-Download and Install NLTK

• Use: nltk.download() in Python Shell


• Use the downloader to download packages, corpora, models
Step 2-Stemming Words using
PorterStemmer and LancasterStemmer
• PorterStemmer uses Suffix Stripping to produce stems. Notice how the
PorterStemmer is giving the root (stem) of the word "cats" by simply
removing the 's' after cat.
• Simple and Speedy

• The LancasterStemmer (Paice-Husk stemmer) is an iterative algorithm


with rules saved externally
• One table containing about 120 rules indexed by the last letter of a suffix
• On each iteration, it tries to find an applicable rule by the last character of the
word
• Each rule specifies either a deletion or replacement of an ending
Step 2-Stemming Words using
PorterStemmer and LancasterStemmer
Import Stemmers

Create Objects of Stemmers

Stemming using PorterStemmer

Stemming using LancasterStemmer


Step 2-Stemming Sentences (without
TOKENIZATION)
Import Stemmers

Create Objects of Stemmers

Stemming using

1. PorterStemmer
2. LancasterStemmer

As you see the stemmer sees the entire sentence as a


word, so it returns it as it is. We need to stem each
word in the sentence and return a combined sentence.
To separate the sentence into words, you can
use tokenizer.
Step 2-Stemming Sentences (with
TOKENIZATION)
Import Stemmers

Create Objects of Stemmers

Import TOKENIZER

Set the Sentence

Tokenize the Sentence &


Display TOKENS

Stem TOKENS and store


In a separate list

Join the list containing


STEM words
Natural Language Processing
Practical 2 BS(SE) 5th Semester

Dr. Nayyar Iqbal, Lecturer Department of Computer Science


Split Text into Words via Tokenization

Python Practical
Objectives
• What is tokenization?
• Different ways to perform tokenization
What is Tokenization?
• Tokenization is a common task a data scientist comes
across when working with text data
• It consists of splitting an entire text into small units, also
known as tokens
• NLP projects have tokenization as the first step because it
is the foundation for developing good models and helps
better understand the text we have
Different Ways to Tokenize Text
1. Using split() function
2. NLTK
Tokenization with split() - Steps
• Open the text document

• Print the text document

• Tokenize the text document

• Display tokens on screen


Complete Code
2. NLTK – Steps
• NLTK stands for Natural Language Toolkit
• This is a suite of libraries and programs for statistical natural language processing
for English written in Python
• Recipe:
• Install NLTK
• Import NLTK
• Read text document file
• Display the text
• Tokenize text
• Display tokens on screen
Install NLTK
• Open command prompt as Administrator
• Use command pip install NLTK in command prompt
Import NLTK in Python
• NLTK needs to be imported in Python program
• NLTK contains a module called tokenize – we’ll be importing this module
• import nltk
• from nltk.tokenize import word_tokenize
Open and read text document file
• Same method used previously
Display text document
• Same method used previously
Tokenize text with NLTK & Display Tokes
• Use word_tokenize() method

• print() method to display tokens


Complete Code
Error! (Use nltk.download)

• Use: nltk.download()
Natural Language Processing
Practical 3 BS(SE) 5th Semester

Dr. Nayyar Iqbal, Lecturer Department of Computer Science


Remove Punctuation Marks from Text

Python Practical
Punctuations
• Symbols
• Used to punctuate text
• .,;“‘!- [different punctuation marks]
How to remove punctuation marks?
• Three methods:
• Using string library
• Using for each loop
• Using replace() function
Method # 1 – Using string library
1. Import string library
2. Open a text file and read it
3. Print text
4. Remove punctuations
5. Print text without punctuations
Step # 1 – Import string library
• Python string is a built-in module contains some
constants, utility function, and classes for string
manipulation.
Step # 2 – Open text file and read it
• Use already discussed method
Step # 3 – Print text
• Use already discussed method
Step # 4 – Remove punctuations
• Use translate() and maketrans() methods

https://datagy.io/python-remove-punctuation-from-string/
Step # 4 – Print text without punc.
Complete Code
Detect Language of Text
Python Practical 4
Using “TextBlob”
• Steps:
• Install TextBlob library
Step 1: Install TextBlob Library
• Open Command Prompt
• Install TextBlob library using command:
• pip install textblob
Step 2: Import textblob Library
• Open python IDLE Shell
• Write the code to import Translator
• from textblob import TextBlob
Step 3: Detect the language
• Create text variable and store some sentence:

• Convert sentence into Blob and detect language


Complete Code
Remove Stop Words from
Text
Python Practical 5
Stop Words
• Common words in any language like:
• Articles e.g., the and a/an
• Prepositions
A preposition is a word or group of words used before a noun, pronoun,
or noun phrase to show direction, time, place, location, spatial
relationships, or to introduce an object. Some examples of
prepositions are words like "in," "at," "on," "of," and "to.“
• Pronouns
A pronoun (I, me, he, she, herself, you, it, that, they, each, few, many, who,
whoever, whose, someone, everybody, etc.) is a word that takes the place
of a noun. In the sentence Joe saw Jill, and he waved at her, the pronouns
he and her take the place of Joe and Jill, respectively. There are three
types of pronouns: subject (for example, he ); object (him); or possessive
(his).
• Conjunctions
A conjunction is a word that is used to connect words,
phrases, and clauses. There are many conjunctions in the
English language, but some common ones include and, or, but,
because, for, if, and when.
Methods to remove stop words
• Different methods:
• Using NLTK
• Using spaCV
• Using Gensim
• Using SKLearn
Method # 1: Using NLTK
• NLTK = Natural Language Toolkit
• Steps:
• Install nltk
• Import nltk
• Open and read text file
• Display the text in the text file (optional)
• Get a list of stop words
• Remove stop words
• Print modified output
Step 1: Install NLTK
• Open command prompt as Administrator
• Use command pip install NLTK in command prompt
Step 2: Import NLTK in Python
• NLTK needs to be imported in Python program
• NLTK contains a module called tokenize – we’ll be importing this module
• import nltk
• from nltk.tokenize import word_tokenize
Step 3: Open and read text document
file
• Same method used previously
Step 4: Display the text in the text
file
• Same method used previously
Step 5: Get list of stop words
Step 6: Remove stop words
Step 7: Print modified text
Complete code

You might also like