NLP-Lab Manual - Ashwini - Kachare
NLP-Lab Manual - Ashwini - Kachare
Semester: VI
Aim: Write a Program and Implement of Processing text Word and Sentence Tokenization,
LowerCase & Uppercase using Python Programming.
Theory:
Tokenization is the process by which a large quantity of text is divided into smaller parts called
tokens. These tokens are very useful for finding patterns and are considered as a base step for
stemming and lemmatization. Tokenization also helps to substitute sensitive data elements with
non-sensitive data elements.
Natural language processing is used for building applications such as Text classification,
intelligent chatbot, sentimental analysis, language translation, etc. It becomes vital to
understand the pattern in the text to achieve the above-stated purpose.
Natural Language toolkit has very important module NLTK tokenize sentences which further
comprises of sub-modules
1. Word tokenize
2. Sentence tokenize
Tokenization of words
We use the method word_tokenize() to split a sentence into words. The output of word
tokenization can be converted to Data Frame for better text understanding in machine learning
applications. It can also be provided as input for further text cleaning steps such as punctuation
removal, numeric character removal or stemming. Machine learning models need numeric data
to be trained and make a prediction. Word tokenization becomes a crucial part of the text
(string) to numeric data conversion
Tokenization of Sentences
Sub-module available for the above is sent_tokenize. An obvious question in your mind would
be why sentence tokenization is needed when we have the option of word tokenization. Imagine
you need to count average words per sentence, how you will calculate? For accomplishing such
a task, you need both NLTK sentence tokenizer as well as NLTK word tokenizer to calculate
the ratio. Such output serves as an important feature for machine training as the answer would
be numeric.
Code:
#Tokenization
text = "Hello Everyone"
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize,word_tokenize
Output:
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
Word Tokenization
word_tokenize(text)
Output
['Hello', 'Everyone']
Sentence Tokenization
Output:
['God', 'is', 'Great', '!', 'I', 'Won', 'a', 'Lottery', '.']
Screenshot:
Lower Case
Code:
import string
raw_docs = ["I am wariting some very basic english sentences", "I`m just writing it for the
demo PURPOSE to make audience understand the basics .""The Point is to learn HOW it
works_on #simple # data. "]
raw_docs = [doc.lower() for doc in
raw_docs] print(raw_docs)
Output:
['i am wariting some very basic english sentences', 'i`m just writing it for the demo purpose to
make audience understand the basics .the point is to learn how it works_on #simple # data. ']
Upper Case
Code:
Output:
Outcome: After Finishing of this Practical student will be able to understand basics Text
Processing
Experiment No.2
Aim: Write a Program and Implement of Processing on Text Stop Word Removal,Punctuations and
Filtration.
Theory:
English text may contain stop words like ‘the’, ‘is’, ‘are’. Stop words can be filtered from the
text to be processed. There is no universal list of stop words in NLP research, however the
NLTK module contains a list of stop words. Now you will learn how to remove stop words
using the NLTK.
Filtration:
Many of the words used in the phrase are insignificant and hold no meaning. For example –
English is a subject. Here, ‘English’ and ‘subject’ are the most significant words and ‘is’, ‘a’
are almost useless. English subject and subject English holds the same meaning even if we
remove the insignificant words – (‘is’, ‘a’).
Stop-Word
Removal Code:
import nltk
nltk.download("stopword
Output:
s")
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
True
Code:
print(stopwords.words("english"))
Output:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
Screenshot
Filtering
Code:
import nltk
text = "This is an example text for stopword removal and filtering. This is done using
NLTK's stopwords."
words = nltk.word_tokenize(text)
Screenshot
Outcome: - After the practical, student have understood and Implementation of Processing
text Stop Word Removal, Filtration.
Experiment No.3
Lemmatization Algorithm
Objective: To make student learn how stem and lemmatize words in NLP
Theory:
What is Stemming?
In another word, there is one root word, but there are many variations of the same words. For
example, the root word is "eat" and it's variations are "eats, eating, eaten and like so". In the
same way, with the help of Stemming, we can find the root word of any variations.
Example
He was riding.
In the above two sentences, the meaning is the same, i.e., riding activity in the past. A human
can easily understand that both meanings are the same. But for machines, both sentences are
different. Thus it became hard to convert it into the same data row. In case we do not provide
the same data-set, then machine fails to predict. So it is necessary to differentiate the meaning
of each word to prepare the dataset for machine learning. And here stemming is used to
categorize the same type of data by getting its root word.
What is Lemmatization
Stemming Code:
lancaster_stemmer = LancasterStemmer()
print(lancaster_stemmer.stem('observing'))
print(lancaster_stemmer.stem('observs'))
print(lancaster_stemmer.stem('observe'))
Output
observ
observ
observ
Screenshot:
Lemmatization Code
from nltk.stem import WordNetLemmatize
nltk.download('wordnet')
Output
[nltk_data] Downloading package wordnet to /root/nltk_data...
True
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("running"))
print(lemmatizer.lemmatize("runs"))
output
running
run
Screenshot:
Lemmatizer-Returns verb,noun,Adverb,Adjective
form
11
def lemmtize(word):
lemmatizer = WordNetLemmatizer()
print("verb form: "+
lemmatizer.lemmatize(word,pos="v")) print("noun form:" +
lemmatizer.lemmatize(word,pos="n")) print("adverb form:"
+ lemmatizer.lemmatize(word,pos="r")) print("adjective
form:" +lemmatizer.lemmatize(word,pos="a"))
lemmtize("ears")
verb form: ears
Output
noun form:ear
adverb form:ears
adjective
form:ears
Screensh
The Following code snippet shows the comparison between stemming and
lemmatization
Code:
import nltk
nltk.download('wordnet')
Output
[nltk_data] Downloading package wordnet to /root/nltk_data...
print(lemmatizer.lemmatize("deactivating",pos="v")
)
print(lemmatizer.lemmatize("deactivative",pos="r"))
True
print(lemmatizer.lemmatize("deactivating",pos="n")
) Output
deactivate
deactivative
deactivatin
g
print(stemmer.stem('stones'))
print(stemmer.stem('speaking'))
print(stemmer.stem('bedroom'))
print(stemmer.stem('jokes'))
print(stemmer.stem('lisa'))
print(stemmer.stem('purple'))
Output
stone
speak
bedroom
joke
lisa
purpl
print(lemmatizer.lemmatize('stones'))
print(lemmatizer.lemmatize('speaking'))
print(lemmatizer.lemmatize('bedroom'))
print(lemmatizer.lemmatize('jokes'))
print(lemmatizer.lemmatize('lisa'))
print(lemmatizer.lemmatize('purple'))
Output
stone
speaking
bedroom
joke
lisa
purple
Screenshot:
Outcome: - After the practical, student have understood and Implementation of Stemming
and Lemmatization.
Experiment No.4
Aim: Write a Program and implementation of the different POS taggers and Perform POS
tagging on the text.
Theory:
POS Tagging (Parts of Speech Tagging) is a process to mark up the words in text format for a
particular part of a speech based on its definition and context. It is responsible for text reading
in a language and assigning some specific token (Parts of Speech) to each word. It is also called
grammatical tagging.
Below:
What is Chunking in NLP
Chunking in NLP is a process to take small pieces of information and group them into large
units. The primary use of Chunking is making groups of "noun phrases." It is used to add
structure to the sentence by following POS tagging combined with regular expressions. The
resulted group of words are called "chunks." It is also called shallow parsing.
In shallow parsing, there is maximum one level between roots and leaves while deep parsing
comprises of more than one level. Shallow parsing is also called light parsing or chunking.
Rules for Chunking: There are no pre-defined rules, but you can combine them according to
need and requirement.
For example, you need to tag Noun, verb (past tense), adjective, and coordinating junction
from the sentence. You can use the rule as below
chunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}
Code:
import nltk
nltk.download('punkt')
from nltk import word_tokenize,pos_tag
nltk.download('averaged_perceptron_tagger'
) sentence = "Book the ticket"
sentence_tokens = word_tokenize(sentence)
print(sentence_tokens)
pos_tag (sentence_tokens)
Output
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
True
[nltk_data] Downloading package averaged_perceptron_tagger
to [nltk_data] /root/nltk_data...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
True
['Book', 'the', 'ticket']
Chunking-making word
phrases Code:
import nltk
text = "The clean data is important for application development."
tokens = nltk.word_tokenize(text)
print(tokens)
tag = nltk.pos_tag(tokens)
print(tag)
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp =nltk.RegexpParser(grammar)
result = cp.parse(tag)
print(result)
Output:
(S
(NP The/DT clean/JJ
data/NN) is/VBZ
important/JJ
for/IN
(NP application/NN)
(NP development/NN)
./.)
Screenshot:
Outcome: - After the practical, student have understood and implementation of the different
POS taggers and Perform POS tagging on the text.
Experiment No.5
Aim: Write a Program and implement N-Gram model for the given text input.
Theory: N-grams are one of the fundamental concepts every data scientist and computer
science professional must know while working with text data. In this beginner-level tutorial,
we will learn what n-grams are and explore them on text data in Python. The objective of the
blog is to analyze different types of n-grams on the given text data and hence decide which n-
gram works the best for our data.
N-gram model predicts the most probable word that might follow this sequence. It's a
probabilistic model that's trained on a corpus of text. Such a model is useful in many NLP
applications including speech recognition, machine translation and predictive text input. An N-
gram model is built by counting how often word sequences occur in corpus text and then
estimating the probabilities. Since a simple N-gram model has limitations, improvements are
often made via smoothing, interpolation and backoff. An N-gram model is one type of a
Language Model (LM), which is about finding the probability distribution over word
sequences.
Code:
Screenshot:
Outcome: - After the practical, student have understood and implement N-Gram model for
the given text input.
Experiment No. 06
Aim: Write a Program and Analysis of Exploratory data on text (Word Cloud)
Objective: To make student understand the Exploratory data on text (Word Cloud)
Theory:
Exploratory Data Analysis is the process of exploring data, generating insights, testing
hypotheses, checking assumptions and revealing underlying hidden patterns in the data.
There are no shortcuts in a machine learning project lifecycle. We can’t simply skip to the
model building stage after gathering the data. We need to plan our approach in a structured
manner and the exploratory data analytics (EDA) stage plays a huge part in that.
We need to perform investigative and detective analysis of our data to see if we can unearth
any insights.
And there’s no shortage of text data, is there? We have data being generated from tweets, digital
media platforms, blogs, and a whole host of other sources. As a data scientist and an NLP
enthusiast, it’s important to analyze all this text data to help your organization make data-driven
decisions.
Code:
import numpy as np
import pandas as pd
from google.colab import files
import matplotlib.pyplot as plt
import seaborn as sns
import string
from wordcloud import wordcloud
upload =files.upload()
for fn in upload.keys():
print('User uploaded file "{name}" with length {length}
bytes'.format( name=fn,length=len(upload[fn])))
upload
import io
Reviews_df = pd.read_csv(io.StringIO(upload['Reviews.csv'].decode('utf-8')))
Outcome: - After the practical, student have understood and implemented the Exploratory
data on text (Word Cloud)
Experiment No. 07
Algorithm
Theory:
WordNet is a lexical database for the English language, which was created by Princeton, and
is part of the NLTK corpus. It is a machine-readable database of words which can be accessed
from most popular programming languages (C, C#, Java, Ruby, Python etc.). WordNet
superficially resembles a thesaurus, in that it groups words together based on their meanings.
WordNet is not like your traditional dictionary. WordNet focuses on the relationship between
words along with their definitions, and this makes a WordNet a network instead of a list. NLTK
includes the English WordNet, with 155,287 words and 117,659 synonym sets.
In the WordNet network, the words are connected by linguistic relations. These linguistic
relations (hypernym, hyponym, meronym, holonym and other fancy sounding stuff), are
WordNet’s secret sauce. They give you powerful capabilities that are missing in an ordinary
dictionary/thesaurus.
1) Synonyms
WordNet stores synonyms in the form of synsets where each word in the synset shares
the same meaning. Basically, each synset is a group of synonyms. Each synset has a
definition associated with it. Relations are stored between different synsets. In the
following example. Take the word ‘sofa’. We have only one synset for ‘sofa’ which
means that it has only one context or meaning. Another word like ‘jupiter’ will give
two synsets because it has two meanings – one as ‘planet’ and the other as ‘Roman
God’.
1) Synonyms
WordNet stores synonyms in the form of synsets where each word in the synset
shares the same meaning. Basically, each synset is a group of synonyms. Each
synset has a definition associated with it. Relations are stored between different
synsets. In the following example. Take the word ‘sofa’. We have only one synset
for ‘sofa’ which means that it has only one context or meaning. Another word like
‘jupiter’ will give two synsets because it has two meanings – one as ‘planet’ and the
other as ‘Roman God’.
[Synset('jupiter.n.01'), Synset('jupiter.n.02')]
syns[0].definition()
‘the largest planet and the 5th from the sun; has many satellites and is one of the brightest
objects in the night sky’
syns[1].definition()
‘(Roman mythology) supreme god of Romans; counterpart of Greek Zeus’
2) Hyponyms and Hypernyms
Hyponyms and Hypernyms are specific and generalized concepts respectively. For
example, ‘beach house’ and ‘guest house’ are hyponyms of ‘house’. They are more
specific concepts of ‘house’. And ‘house’ is a hypernym of ‘guest house’ because
it is a general concept. ‘Egg Noodle’ is a hyponym of ‘noodle’ and ‘pasta’ is a
hypernym of ‘noodle.
wn.synset('noodle.n.01').hyponyms()
[Synset('egg_noodle.n.01')]
wn.synset('noodle.n.01').hypernyms()
[Synset('pasta.n.02')]
wn.synset('egg_noodle.n.01').definition()
‘narrow strip of pasta dough made with eggs’
wn.synset('pasta.n.01').definition()
‘a dish that contains pasta as its main ingredient’
3) Meronyms and Holonyms
Meronyms and Holonyms represent the part-whole relationship. The meronym
represents the part and the holonym represents the whole. For example, ‘kitchen’ is
a meronym of ‘home'(the kitchen is a part of the home), ‘mattress’ is a meronym of
‘bed’, and ‘bedroom’ is a holonym of ‘bed’.
wn.synset('bed.n.01').part_holonyms()
[Synset('bedroom.n.01')]
wn.synset('bed.n.01').part_meronyms()
[Synset('bedstead.n.01'), Synset('mattress.n.01')]
4) Word Similarity
We can compute the similarity between two words based on the distance between
words in the WordNet network. The smaller the distance, the more similar the
words. In this way, it is possible to quantitatively figure out that a cat and a dog are
similar, a phone and a computer are similar, but a cat and a phone are not similar!
import nltk
nltk.edit_distance("humpty", "dumpty")
Output: 1
import
difflib
seq =
difflib.SequenceMatcher(None,a,b) d
= seq.ratio()*100
print(d)
Output: 87.32394366197182
import difflib
a = 'phone'
b = 'computer'
seq =
difflib.SequenceMatcher(None,a,b) d
= seq.ratio()*100
print(d)
Output: 30.76923076923077
Lesk algorithm
consider three examples of the distinct senses that exist for the word "bass":
1. a type of fish
2. tones of low frequency
3. a type of
instrument and the
sentences:
1. I went fishing for some sea bass.
2. The bass line of the song is too weak.
To a human, it is obvious that the first sentence is using the word "bass (fish)", as
in the former sense above and in the second sentence, the word "bass (instrument)"
is being used as in the latter sense below. Developing algorithms to replicate this
human ability can often be a difficult task,
In the above example, for the first and the second sentence, the lesk algorithm is
some what accurate in understanding the context of the word bass in the sentence.
But for the third sentence where the bass is in the context of musical instrument, it
is estimating the word as Synset('sea_bass.n.01) which is clearly not correct!
Unfortunately, Lesk’s approach is very sensitive to the exact wording of definitions,
so the absence of a certain word can radically change the results.
Outcome: - After the practical, student have understood and implemented the lesk algorithm
Experiment .08
Aim: CASE STUDY: Application of NLP-Sentiment Analysis of Real Comments
in Social Media platform
Theory: A Twitter sentiment analysis determines negative, positive, or neutral emotions within the
text of a tweet using NLP and ML models. Sentiment analysis or opinion mining refers to
identifying as well as classifying the sentiments that are expressed in the text source. Tweets are
often useful in generating a vast amount of sentiment data upon analysis. These data are useful in
understanding the opinion of people on social media for a variety of topics.
Practical:
!pip install -q transformers
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-
analysis") data =["I Love You","I hate you"]
sentiment_pipeline(data)
specific_model = pipeline(model="finiteautomata/bertweet-base-sentiment-analysis")
Screenshot:
Outcome: After the practical, student have understood and implemented the Sentiment Analysis of tweets in
Social Media platform