NLP Lab Manual (R20)
NLP Lab Manual (R20)
KRISHNACHAITANYA INSTITUTE OF
TECHNOLOGY& SCIENCES
MARKAPUR,PRAKASAM-DIST.(A.P)
(Affiliated to J.N.T.University,Kakinada)
CERTIFICATE
Certified that this is a bonafied record of Practical Work done by
Mr./Miss……………………………………………………………………………
Examiner -1 Examiner -2
import re
def remove_noise(text):
# Remove URLs
# Remove usernames
# Remove hashtags
return text
# Example text
text = "Just had the best coffee from @Starbucks! #coffee #yum 😍 http://starbucks.com"
# Remove noise
clean_text = remove_noise(text)
print(clean_text)
OUTPUT:
Page 1
Experiment – 2
Perform lemmatization and stemming using python library nltk
Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope
of achieving this goal correctly most of the time, and often includes the removal of derivational
affixes.
Code:
OUTPUT:
Porter Stemmer
cats => cat
trouble =>troubl
troubling =>troubl
troubled =>troubl
Page 2
Lancaster Stemmer
cats => cat
trouble =>troubl
troubling =>troubl
troubled =>troubl
Code:
sentence="Pythoners are very intelligent and work very pythonly and now they are python
ing their way to success."
x=find(sentence)
print(x)
Output:
python are veriintellig and work veripythonli and now they are python their way to success .
Lemmatization
Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope
of achieving this goal correctly most of the time, and often includes the removal of derivational
affixes.
Lemmatization usually refers to doing things properly with the use of a vocabulary and
morphological analysis of words, normally aiming to remove inflectional endings only and to
return the base or dictionary form of a word, which is known as the lemma . If confronted with
the token saw, stemming might return just s, whereas lemmatization would attempt to return
either see or saw depending on whether the use of the token was as a verb or a noun.
Page 3
For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all
these words. Because lemmatization returns an actual word of the language, it is used where it is
necessary to get valid words.
Python NLTK provides WordNet Lemmatizer that uses the WordNet Database to lookup lemmas
of words.
Code:
fromnltk.stemimportWordNetLemmatizer
wordnet_lemmatizer=WordNetLemmatizer()
sentence="He was running and eating at same time. He has bad habit of swimming after playing
long hours in the Sun."
punctuations="?:!.,;"
sentence_words=nltk.word_tokenize(sentence)
forwordinsentence_words:
ifwordinpunctuations:
sentence_words.remove(word)
sentence_words
print("{0:20}{1:20}".format("Word","Lemma"))
forwordinsentence_words:print("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(wo
rd)))
Output:
Word Lemma
He He
waswa
runningrunning
andand
eatingeating
atat
samesame
timetime
He He
has ha
badbad
habithabit
ofof
swimmingswimming
afterafter
playingplaying
longlong
hours hour
Page 4
in in
thethe
Sun Sun
In the above code output, you must be wondering that no actual root form has been given for any
word, this is because they are given without context.
You need to provide the context in which you want to lemmatize that is the parts-of-speech
(POS). This is done by giving the value for pos parameter in wordnet_lemmatizer.lemmatize.
Page 5
Experiment – 3
Demonstrate object standardization such as replace social media slangs from
a text.
Code:
"tq":"thankyou","vry":"very","yt":"youtube","fb":"facebook",
"insta":"instagram","u":"you","tmrw":"tommorow","snap":"snapchat",
"wlcm":"welcome","uncntble":"uncountable","bday":"birthday"}
def _lookup_words(input_text):
words = input_text.split()
new_words = []
if word.lower() in lookup_dict:
word = lookup_dict[word.lower()]
new_words.append(word)
return new_text
aatweet= "rt from aa for uncntble wishes from fans all over the world on his bday!!\n <3 <3 <3
\n \" hlo everyone!! \n I had got so much luv from you all!! \n tq for all your awsm luv and
affection \n i am soo happy to get great luv from u all , \n yours lovingly aa !!"
print(aatweet)
print(_lookup_words(aatweet))
Page 6
own=input("enter your own message language to convert it into formal:")
print(_lookup_words(own))
OUTPUT :
rt from aa for uncntble wishes from fans all over the world on his bday!!
yours lovingly aa !!
Retweet from allu arjun for uncountable wishes from fans all over the world on his bday!! ♡
♡ ♡ " hello everyone!! I had got so much love from you all!! thankyou for all your
awesome love and affection i am soo happy to get great love from you all , yours lovingly allu
arjun !!
Page 7
Experiment – 4
Perform part of speech tagging on any textual data
Code:
importnltk
fromnltk.corpus import stopwords
fromnltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))
#Dummy text
txt = "Sukanya, Rajib and Naba are my good friends. " \
"Sukanya is getting married next year. " \
"Marriage is a big step in one’s life." \
"It is both exciting and frightening. " \
"But friendship is a sacred bond between people." \
"It is a special kind of love between us. " \
"Many of you must have tried searching for a friend "\
"but never found the right one."
Output:
[('Sukanya', 'NNP'), ('Rajib', 'NNP'), ('Naba', 'NNP'), ('good', 'JJ'), ('friends', 'NNS')]
Page 8
[('Sukanya', 'NNP'), ('getting', 'VBG'), ('married', 'VBN'), ('next', 'JJ'), ('year', 'NN')]
[('Marriage', 'NN'), ('big', 'JJ'), ('step', 'NN'), ('one', 'CD'), ('’', 'NN'), ('life', 'NN')]
[('It', 'PRP'), ('exciting', 'VBG'), ('frightening', 'VBG')]
[('But', 'CC'), ('friendship', 'NN'), ('sacred', 'VBD'), ('bond', 'NN'), ('people', 'NNS')]
[('It', 'PRP'), ('special', 'JJ'), ('kind', 'NN'), ('love', 'VB'), ('us', 'PRP')]
[('Many', 'JJ'), ('must', 'MD'), ('tried', 'VB'), ('searching', 'VBG'), ('friend', 'NN'),
('never', 'RB'), ('found', 'VBD'), ('right', 'RB'), ('one', 'CD')]
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is” … think of it like “there exists”)
FW foreign word
IN preposition/subordinating conjunction
JJ adjective – ‘big’
JJR adjective, comparative – ‘bigger’
JJS adjective, superlative – ‘biggest’
LS list marker 1)
MD modal – could, will
NN noun, singular ‘- desk’
NNS noun plural – ‘desks’
NNP proper noun, singular – ‘Harrison’
NNPS proper noun, plural – ‘Americans’
PDT predeterminer – ‘all the kids’
POS possessive ending parent’s
PRP personal pronoun – I, he, she
PRP$ possessive pronoun – my, his, hers
RB adverb – very, silently,
RBR adverb, comparative – better
RBS adverb, superlative – best
RP particle – give up
TO – to go ‘to’ the store.
UH interjection – errrrrrrrm
VB verb, base form – take
VBD verb, past tense – took
VBG verb, gerund/present participle – taking
VBN verb, past participle – taken
VBP verb, sing. present, non-3d – take
VBZ verb, 3rd person sing. present – takes
WDT wh-determiner – which
WP wh-pronoun – who, what
WP$ possessive wh-pronoun, eg- whose
WRB wh-adverb, eg- where, when
Page 9
Experiment – 5
Implement topic modeling using Latent Dirichlet Allocation (LDA ) in python.
Code:
import gensim
# Example corpus
# Preprocessing
texts = [[word for word in document.lower().split() if word not in stopwords] for document in
documents]
dictionary = corpora.Dictionary(texts)
# Results
Page 10
OUTPUT:
Page 11
Experiment – 6
Demonstrate Term Frequency – Inverse Document Frequency (TF – IDF)
using python
Code:
# Example corpus
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
feature_names = tfidf_vectorizer.get_feature_names()
idf_scores = tfidf_vectorizer.idf_
# Printing results
tfidf_score = tfidf_matrix[i, j]
idf_score = idf_scores[j]
Page 12
tf_score = tfidf_score / idf_score
OUTPUT:
Page 13
hello: TF-IDF=0.707, IDF=1.099, TF=0.643
Page 14
Experiment – 7
Demonstrate word embeddings using word2vec
Code:
# Python program to generate word vectors using Word2Vec
# importing all necessary modules
fromnltk.tokenize import sent_tokenize, word_tokenize
import warnings
warnings.filterwarnings(action = 'ignore')
importgensim
fromgensim.models import Word2Vec
# Print results
print("Cosine similarity between 'alice' " + "and 'wonderland' - CBOW : ",
model1.wv.similarity('alice', 'wonderland'))
# Print results
print("Cosine similarity between 'alice' " + "and 'wonderland' - Skip Gram : ",
model2.wv.similarity('alice', 'wonderland'))
print("Cosine similarity between 'alice' " + "and 'machines' - Skip Gram : ",
model2.wv.similarity('alice', 'machines'))
Page 15
Output :
Cosine similarity between 'alice' and 'wonderland' - CBOW : 0.999249298413
Cosine similarity between 'alice' and 'machines' - CBOW : 0.974911910445
Cosine similarity between 'alice' and 'wonderland' - Skip Gram : 0.885471373104
Cosine similarity between 'alice' and 'machines' - Skip Gram : 0.856892599521
Page 16
Experiment – 8
Implement Text classification using naïve bayes classifier and text blob
library
Code:
# Training data
train_data = [
classifier = NaiveBayesClassifier(train_data)
# Testing data
test_data = [
result = classifier.classify(data)
Page 17
print(f"{data}: {result}")
OUTPUT:
Page 18
Experiment – 9
Apply support vector machine for text classification.
Code:
# Example corpus
corpus = [
# Creating TfidfVectorizer object and transforming the training and testing data
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
Page 19
# Creating a SVM classifier object and fitting the training data
svm = SVC(kernel='linear')
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)
print(classification_report(y_test, y_pred))
OUTPUT:
accuracy 1.00 2
Page 20
Experiment – 10
Convert text to vectors (using term frequency) and apply cosine similarity to
provide closeness among two text.
Code:
# Example text
# Creating a CountVectorizer object and transforming the texts into feature vectors
vectorizer = CountVectorizer()
Page 21
# Calculating cosine similarity between text2 and text4
OUTPUT:
Page 22