50% found this document useful (2 votes)

2K views24 pages

NLP Lab Manual (R20)

Uploaded by

Gopi Naveen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

50% found this document useful (2 votes)

2K views24 pages

NLP Lab Manual (R20)

Uploaded by

Gopi Naveen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

NATURAL LANGUAGE PROCESSING WITH PYTHON LAB ROLL NO:

KRISHNACHAITANYA INSTITUTE OF
TECHNOLOGY& SCIENCES
MARKAPUR,PRAKASAM-DIST.(A.P)
(Affiliated to J.N.T.University,Kakinada)

CERTIFICATE
Certified that this is a bonafied record of Practical Work done by

Mr./Miss……………………………………………………………………………

Roll No. ……………………. Of..……. Year B.tech ......................... Semester in

………………………………….Laboratory during the academic year 2022-2023.

Staff Member In charge Head of the Department

Submitted for the practical examination held on …………………

Examiner -1 Examiner -2

KRISHNACHAITANYA INSTITUTE OF TECHNOLOGY & SCIENCES::MARKAPUR

INDEX
S. Date List of Page Remarks
No. Experiments No.
1 Demonstrate Noise Removal for any textual data and remove
regular expression pattern such as hash tag from textual data. 1

2 Perform lemmatization and stemming using python library nltk.

2-5

3 Demonstrate object standardization such as replace social media

slangs from a text. 6-7

4 Perform part of speech tagging on any textual data.

8-9

5 Implement topic modeling using Latent Dirichlet Allocation (LDA )

in python. 10-11

6 Demonstrate Term Frequency – Inverse Document Frequency

(TF – IDF) using python 12-14

7 Demonstrate word embedding’s using word2vec.

15-16

8 Implement Text classification using naïve bayes classifier and

text blob library. 17-18

9 Apply support vector machine for text classification

19-20

10 Convert text to vectors (using term frequency) and apply cosine

similarity to provide closeness among two text. 21-22
Experiment - 1
Demonstrate Noise Removal for any textual data and remove regular
expression pattern such as hash tag from textual data
Code:

import re

def remove_noise(text):

# Remove URLs

text = re.sub(r'http\S+', '', text)

# Remove usernames

text = re.sub(r'@\S+', '', text)

# Remove hashtags

text = re.sub(r'#\S+', '', text)

# Remove punctuation and other non-alphanumeric characters

text = re.sub(r'[^\w\s]', '', text)

# Remove extra whitespace

text = re.sub(r'\s+', ' ', text).strip()

return text

# Example text

text = "Just had the best coffee from @Starbucks! #coffee #yum 😍 http://starbucks.com"

# Remove noise

clean_text = remove_noise(text)

print(clean_text)

OUTPUT:

Just had the best coffee from!

Page 1
Experiment – 2
Perform lemmatization and stemming using python library nltk
Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope
of achieving this goal correctly most of the time, and often includes the removal of derivational
affixes.

Code:

fromnltk.stem import PorterStemmer

fromnltk.stem import LancasterStemmer
#create an object of class PorterStemmer
porter = PorterStemmer()
lancaster=LancasterStemmer()
#provide a word to be stemmed
print("Porter Stemmer")
print("cats => ",porter.stem("cats"))
print("trouble => ",porter.stem("trouble"))
print("troubling =>", porter.stem("troubling"))
print("troubled => ",porter.stem("troubled"))
print("Lancaster Stemmer")
print("cats => ",lancaster.stem("cats"))
print("trouble => ",lancaster.stem("trouble"))
print("troubling =>",lancaster.stem("troubling"))
print("troubled => ",lancaster.stem("troubled"))

OUTPUT:

Porter Stemmer
cats => cat
trouble =>troubl
troubling =>troubl
troubled =>troubl
Page 2
Lancaster Stemmer
cats => cat
trouble =>troubl
troubling =>troubl
troubled =>troubl

Stemming a Complete Sentence

Code:

fromnltk.tokenize import sent_tokenize, word_tokenize

fromnltk.stem import PorterStemmer
#from nltk.stem import LancasterStemmer
porter = PorterStemmer()
def find(sentence):
token_words=word_tokenize(sentence)
print(token_words)
stem_sentence=[]
for word in token_words:
stem_sentence.append(porter.stem(word))
stem_sentence.append(" ")
return "".join(stem_sentence)

sentence="Pythoners are very intelligent and work very pythonly and now they are python
ing their way to success."
x=find(sentence)
print(x)

Output:
python are veriintellig and work veripythonli and now they are python their way to success .

Lemmatization
Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope
of achieving this goal correctly most of the time, and often includes the removal of derivational
affixes.

Lemmatization usually refers to doing things properly with the use of a vocabulary and
morphological analysis of words, normally aiming to remove inflectional endings only and to
return the base or dictionary form of a word, which is known as the lemma . If confronted with
the token saw, stemming might return just s, whereas lemmatization would attempt to return
either see or saw depending on whether the use of the token was as a verb or a noun.

Page 3
For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all
these words. Because lemmatization returns an actual word of the language, it is used where it is
necessary to get valid words.

Python NLTK provides WordNet Lemmatizer that uses the WordNet Database to lookup lemmas
of words.

Code:

fromnltk.stemimportWordNetLemmatizer
wordnet_lemmatizer=WordNetLemmatizer()

sentence="He was running and eating at same time. He has bad habit of swimming after playing
long hours in the Sun."
punctuations="?:!.,;"
sentence_words=nltk.word_tokenize(sentence)
forwordinsentence_words:
ifwordinpunctuations:
sentence_words.remove(word)

sentence_words
print("{0:20}{1:20}".format("Word","Lemma"))
forwordinsentence_words:print("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(wo
rd)))

Output:

Word Lemma
He He
waswa
runningrunning
andand
eatingeating
atat
samesame
timetime
He He
has ha
badbad
habithabit
ofof
swimmingswimming
afterafter
playingplaying
longlong
hours hour

Page 4
in in
thethe
Sun Sun

In the above code output, you must be wondering that no actual root form has been given for any
word, this is because they are given without context.
You need to provide the context in which you want to lemmatize that is the parts-of-speech
(POS). This is done by giving the value for pos parameter in wordnet_lemmatizer.lemmatize.

Page 5
Experiment – 3
Demonstrate object standardization such as replace social media slangs from
a text.
Code:

lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome",

"luv" :"love","hlo":"hello","<3":"â™¡","aa":"allu arjun","ths":"this",

"tq":"thankyou","vry":"very","yt":"youtube","fb":"facebook",

"insta":"instagram","u":"you","tmrw":"tommorow","snap":"snapchat",

"gn":"goodnight","gm":"good morning","ga":"good afternoon",

"wlcm":"welcome","uncntble":"uncountable","bday":"birthday"}

def _lookup_words(input_text):

words = input_text.split()

new_words = []

for word in words:

if word.lower() in lookup_dict:

word = lookup_dict[word.lower()]

new_words.append(word)

new_text = " ".join(new_words)

return new_text

aatweet= "rt from aa for uncntble wishes from fans all over the world on his bday!!\n <3 <3 <3
\n \" hlo everyone!! \n I had got so much luv from you all!! \n tq for all your awsm luv and
affection \n i am soo happy to get great luv from u all , \n yours lovingly aa !!"

print(aatweet)

print("THE CONVERTED MESSAGE IS AS FOLLOWS >>:")

print(_lookup_words(aatweet))

Page 6
own=input("enter your own message language to convert it into formal:")

print(_lookup_words(own))

OUTPUT :

rt from aa for uncntble wishes from fans all over the world on his bday!!

<3 <3 <3

" hlo everyone!!

I had got so much luv from you all!!

tq for all your awsm luv and affection

i am soo happy to get great luv from u all ,

yours lovingly aa !!

THE CONVERTED MESSAGE IS AS FOLLOWS >>:

Retweet from allu arjun for uncountable wishes from fans all over the world on his bday!! â™¡
â™¡ â™¡ " hello everyone!! I had got so much love from you all!! thankyou for all your
awesome love and affection i am soo happy to get great love from you all , yours lovingly allu
arjun !!

Page 7
Experiment – 4
Perform part of speech tagging on any textual data
Code:
importnltk
fromnltk.corpus import stopwords
fromnltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))

#Dummy text
txt = "Sukanya, Rajib and Naba are my good friends. " \
"Sukanya is getting married next year. " \
"Marriage is a big step in one’s life." \
"It is both exciting and frightening. " \
"But friendship is a sacred bond between people." \
"It is a special kind of love between us. " \
"Many of you must have tried searching for a friend "\
"but never found the right one."

# sent_tokenize is one of instances of

# PunktSentenceTokenizer from the nltk.tokenize.punkt module
tokenized = sent_tokenize(txt)
fori in tokenized:

# Word tokenizers is used to find the words

# and punctuation in a string
wordsList = nltk.word_tokenize(i)

# removing stop words from wordList

wordsList = [w for w in wordsList if not w instop_words]
# Using a Tagger. Which is part-of-speech
# tagger or POS-tagger.
tagged = nltk.pos_tag(wordsList)
print(tagged)

Output:

[('Sukanya', 'NNP'), ('Rajib', 'NNP'), ('Naba', 'NNP'), ('good', 'JJ'), ('friends', 'NNS')]

Page 8
[('Sukanya', 'NNP'), ('getting', 'VBG'), ('married', 'VBN'), ('next', 'JJ'), ('year', 'NN')]
[('Marriage', 'NN'), ('big', 'JJ'), ('step', 'NN'), ('one', 'CD'), ('’', 'NN'), ('life', 'NN')]
[('It', 'PRP'), ('exciting', 'VBG'), ('frightening', 'VBG')]
[('But', 'CC'), ('friendship', 'NN'), ('sacred', 'VBD'), ('bond', 'NN'), ('people', 'NNS')]
[('It', 'PRP'), ('special', 'JJ'), ('kind', 'NN'), ('love', 'VB'), ('us', 'PRP')]
[('Many', 'JJ'), ('must', 'MD'), ('tried', 'VB'), ('searching', 'VBG'), ('friend', 'NN'),
('never', 'RB'), ('found', 'VBD'), ('right', 'RB'), ('one', 'CD')]

CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is” … think of it like “there exists”)
FW foreign word
IN preposition/subordinating conjunction
JJ adjective – ‘big’
JJR adjective, comparative – ‘bigger’
JJS adjective, superlative – ‘biggest’
LS list marker 1)
MD modal – could, will
NN noun, singular ‘- desk’
NNS noun plural – ‘desks’
NNP proper noun, singular – ‘Harrison’
NNPS proper noun, plural – ‘Americans’
PDT predeterminer – ‘all the kids’
POS possessive ending parent’s
PRP personal pronoun – I, he, she
PRP$ possessive pronoun – my, his, hers
RB adverb – very, silently,
RBR adverb, comparative – better
RBS adverb, superlative – best
RP particle – give up
TO – to go ‘to’ the store.
UH interjection – errrrrrrrm
VB verb, base form – take
VBD verb, past tense – took
VBG verb, gerund/present participle – taking
VBN verb, past participle – taken
VBP verb, sing. present, non-3d – take
VBZ verb, 3rd person sing. present – takes
WDT wh-determiner – which
WP wh-pronoun – who, what
WP$ possessive wh-pronoun, eg- whose
WRB wh-adverb, eg- where, when

Page 9
Experiment – 5
Implement topic modeling using Latent Dirichlet Allocation (LDA ) in python.
Code:

import gensim

from gensim import corpora

# Example corpus

doc1 = "hello world"

doc2 = "world news"

doc3 = "news update"

doc4 = "world update"

doc5 = "hello update"

documents = [doc1, doc2, doc3, doc4, doc5]

# Preprocessing

stopwords = set('for a of the and to in'.split())

texts = [[word for word in document.lower().split() if word not in stopwords] for document in
documents]

# Creating dictionary and corpus

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]

# LDA model training

lda_model = gensim.models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=2,

passes=10)

# Results

for topic_id, topic in lda_model.show_topics(num_topics=2, num_words=3):

print(f"Topic {topic_id+1}: {topic}")

Page 10
OUTPUT:

Topic 1: 0.263"world" + 0.256"hello" + 0.252*"update"

Topic 2: 0.387"news" + 0.296"update" + 0.196*"world"

Page 11
Experiment – 6
Demonstrate Term Frequency – Inverse Document Frequency (TF – IDF)
using python
Code:

from sklearn.feature_extraction.text import TfidfVectorizer

# Example corpus

doc1 = "hello world"

doc2 = "world news"

doc3 = "news update"

doc4 = "world update"

doc5 = "hello update"

documents = [doc1, doc2, doc3, doc4, doc5]

# Creating TfidfVectorizer object

tfidf_vectorizer = TfidfVectorizer()

# Fitting and transforming the corpus

tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Getting feature names and IDF scores

feature_names = tfidf_vectorizer.get_feature_names()

idf_scores = tfidf_vectorizer.idf_

# Printing results

for i, doc in enumerate(documents):

print(f"Document {i+1}: {doc}")

for j, word in enumerate(feature_names):

tfidf_score = tfidf_matrix[i, j]

idf_score = idf_scores[j]

Page 12
tf_score = tfidf_score / idf_score

print(f" {word}: TF-IDF={tfidf_score:.3f}, IDF={idf_score:.3f}, TF={tf_score:.3f}")

OUTPUT:

Document 1: hello world

hello: TF-IDF=0.707, IDF=1.099, TF=0.643

news: TF-IDF=0.000, IDF=1.099, TF=0.000

update: TF-IDF=0.000, IDF=1.099, TF=0.000

world: TF-IDF=0.707, IDF=1.099, TF=0.643

Document 2: world news

hello: TF-IDF=0.000, IDF=1.099, TF=0.000

news: TF-IDF=0.707, IDF=1.099, TF=0.643

update: TF-IDF=0.000, IDF=1.099, TF=0.000

world: TF-IDF=0.707, IDF=1.099, TF=0.643

Document 3: news update

hello: TF-IDF=0.000, IDF=1.099, TF=0.000

news: TF-IDF=0.707, IDF=1.099, TF=0.643

update: TF-IDF=0.707, IDF=1.099, TF=0.643

world: TF-IDF=0.000, IDF=1.099, TF=0.000

Document 4: world update

hello: TF-IDF=0.000, IDF=1.099, TF=0.000

news: TF-IDF=0.000, IDF=1.099, TF=0.000

update: TF-IDF=0.707, IDF=1.099, TF=0.643

world: TF-IDF=0.707, IDF=1.099, TF=0.643

Document 5: hello update

Page 13
hello: TF-IDF=0.707, IDF=1.099, TF=0.643

news: TF-IDF=0.000, IDF=1.099, TF=0.000

update: TF-IDF=0.707, IDF=1.099, TF=0.643

world: TF-IDF=0.000, IDF=1.099

Page 14
Experiment – 7
Demonstrate word embeddings using word2vec
Code:
# Python program to generate word vectors using Word2Vec
# importing all necessary modules
fromnltk.tokenize import sent_tokenize, word_tokenize
import warnings
warnings.filterwarnings(action = 'ignore')
importgensim
fromgensim.models import Word2Vec

# Reads ‘alice.txt’ file

sample = open("C:\\Users\\Admin\\Desktop\\alice.txt", "utf8")
s = sample.read()

# Replaces escape character with space

f = s.replace("\n", " ") data = []

# iterate through each sentence in the file

fori in sent_tokenize(f): temp = []

# tokenize the sentence into words

for j in word_tokenize(i): temp.append(j.lower()) data.append(temp)

# Create CBOW model

model1 = gensim.models.Word2Vec(data, min_count = 1, vector_size = 100, window = 5)

# Print results
print("Cosine similarity between 'alice' " + "and 'wonderland' - CBOW : ",
model1.wv.similarity('alice', 'wonderland'))

print("Cosine similarity between 'alice' " + "and 'machines' - CBOW : ",

model1.wv.similarity('alice', 'machines'))
# Create Skip Gram model
model2 = gensim.models.Word2Vec(data, min_count = 1, vector_size = 100, window = 5, sg =
1)

# Print results
print("Cosine similarity between 'alice' " + "and 'wonderland' - Skip Gram : ",
model2.wv.similarity('alice', 'wonderland'))
print("Cosine similarity between 'alice' " + "and 'machines' - Skip Gram : ",
model2.wv.similarity('alice', 'machines'))

Page 15
Output :
Cosine similarity between 'alice' and 'wonderland' - CBOW : 0.999249298413
Cosine similarity between 'alice' and 'machines' - CBOW : 0.974911910445
Cosine similarity between 'alice' and 'wonderland' - Skip Gram : 0.885471373104
Cosine similarity between 'alice' and 'machines' - Skip Gram : 0.856892599521

Page 16
Experiment – 8
Implement Text classification using naïve bayes classifier and text blob
library
Code:

from textblob import TextBlob

from textblob.classifiers import NaiveBayesClassifier

# Training data

train_data = [

("I love this product", "positive"),

("This is a great experience", "positive"),

("I hate this product", "negative"),

("I do not like this experience", "negative")

# Creating a Naive Bayes classifier object

classifier = NaiveBayesClassifier(train_data)

# Testing data

test_data = [

"I like this product",

"This is a bad experience"

# Classifying the testing data

for data in test_data:

result = classifier.classify(data)

Page 17
print(f"{data}: {result}")

OUTPUT:

I like this product: positive

This is a bad experience: negative

Page 18
Experiment – 9
Apply support vector machine for text classification.
Code:

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split

from sklearn.svm import SVC

from sklearn.metrics import classification_report

# Example corpus

corpus = [

("I love this product", "positive"),

("This is a great experience", "positive"),

("I hate this product", "negative"),

("I do not like this experience", "negative")

# Splitting corpus into training and testing data

X = [c[0] for c in corpus]

y = [c[1] for c in corpus]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating TfidfVectorizer object and transforming the training and testing data

vectorizer = TfidfVectorizer()

X_train = vectorizer.fit_transform(X_train)

X_test = vectorizer.transform(X_test)
Page 19
# Creating a SVM classifier object and fitting the training data

svm = SVC(kernel='linear')

svm.fit(X_train, y_train)

# Predicting the testing data and evaluating the performance

y_pred = svm.predict(X_test)

print(classification_report(y_test, y_pred))

OUTPUT:

precision recall f1-score support

negative 1.00 1.00 1.00 1

positive 1.00 1.00 1.00 1

accuracy 1.00 2

macro avg 1.00 1.00 1.00 2

weighted avg 1.00 1.00 1.00 2

Page 20
Experiment – 10
Convert text to vectors (using term frequency) and apply cosine similarity to
provide closeness among two text.
Code:

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.metrics.pairwise import cosine_similarity

# Example text

text1 = "I love this product"

text2 = "This is a great experience"

text3 = "I hate this product"

text4 = "I do not like this experience"

# Creating a CountVectorizer object and transforming the texts into feature vectors

vectorizer = CountVectorizer()

vectors = vectorizer.fit_transform([text1, text2, text3, text4])

# Calculating cosine similarity between text1 and text2

similarity = cosine_similarity(vectors[0], vectors[1])[0][0]

print(f"Cosine similarity between text1 and text2: {similarity}")

# Calculating cosine similarity between text1 and text3

similarity = cosine_similarity(vectors[0], vectors[2])[0][0]

print(f"Cosine similarity between text1 and text3: {similarity}")

Page 21
# Calculating cosine similarity between text2 and text4

similarity = cosine_similarity(vectors[1], vectors[3])[0][0]

print(f"Cosine similarity between text2 and text4: {similarity}")

OUTPUT:

Cosine similarity between text1 and text2: 0.0

Cosine similarity between text1 and text3: 0.5

Cosine similarity between text2 and text4: 0.0

Page 22

Nature Inspired Computing Notes 1
100% (1)
Nature Inspired Computing Notes 1
22 pages
R20 Iii-Ii ML Lab Manual
100% (1)
R20 Iii-Ii ML Lab Manual
79 pages
Web App Lab Manual R20 by Hemanth
80% (5)
Web App Lab Manual R20 by Hemanth
41 pages
Kavya
No ratings yet
Kavya
38 pages
Nodejs Lab Manual r22
No ratings yet
Nodejs Lab Manual r22
68 pages
Big - Data Lab Manual
No ratings yet
Big - Data Lab Manual
65 pages
Web and Social Media Analytics Lab
No ratings yet
Web and Social Media Analytics Lab
34 pages
DV Lab Manual
No ratings yet
DV Lab Manual
88 pages
cs8601 Mobile Computing Notes
No ratings yet
cs8601 Mobile Computing Notes
109 pages
NLP - (Natural Language Processing Lab Manual)
No ratings yet
NLP - (Natural Language Processing Lab Manual)
12 pages
First Periodical Test P.E & Health 12
64% (11)
First Periodical Test P.E & Health 12
4 pages
CCS369 - TSS-Unit 3
No ratings yet
CCS369 - TSS-Unit 3
55 pages
Compiler Design Two Marks
50% (2)
Compiler Design Two Marks
17 pages
Machine Learning Unit 5
No ratings yet
Machine Learning Unit 5
43 pages
Os Lab Manual
No ratings yet
Os Lab Manual
107 pages
Natural Language Processing
No ratings yet
Natural Language Processing
37 pages
Cs8661-Internet Programming Laboratory
100% (1)
Cs8661-Internet Programming Laboratory
65 pages
NLP Lab Manual
83% (6)
NLP Lab Manual
56 pages
21cs644 Module 3
100% (1)
21cs644 Module 3
95 pages
Unit5 BD
100% (2)
Unit5 BD
91 pages
Principles of E Phonetics and Phonology
100% (1)
Principles of E Phonetics and Phonology
74 pages
VTU Exam Question Paper With Solution of 18CS72 Big Data and Analytics Feb-2022-Dr. v. Vijayalakshmi
No ratings yet
VTU Exam Question Paper With Solution of 18CS72 Big Data and Analytics Feb-2022-Dr. v. Vijayalakshmi
25 pages
Analytical Learning
No ratings yet
Analytical Learning
42 pages
ML UNIT-5 Notes PDF
No ratings yet
ML UNIT-5 Notes PDF
41 pages
NLP Sem Questions and Answers
No ratings yet
NLP Sem Questions and Answers
72 pages
Deep Learning R18 Jntuh Lab Manual
0% (1)
Deep Learning R18 Jntuh Lab Manual
21 pages
BDA Lab Manual AI&DS
No ratings yet
BDA Lab Manual AI&DS
60 pages
Mini Project Report On KSRTC Ticket Rese
No ratings yet
Mini Project Report On KSRTC Ticket Rese
75 pages
21CS51 (Automata Theory and Compiler Design)
0% (1)
21CS51 (Automata Theory and Compiler Design)
4 pages
WSMA Lab Manual 2
No ratings yet
WSMA Lab Manual 2
8 pages
Daftar Lengkap Regular Verb Beserta Artinya
No ratings yet
Daftar Lengkap Regular Verb Beserta Artinya
38 pages
Co-Po Big Data Analytics
100% (1)
Co-Po Big Data Analytics
41 pages
NLP Lab Manual 3-2 Aiml R22 Update
100% (1)
NLP Lab Manual 3-2 Aiml R22 Update
20 pages
Birth of Indian Cinema
100% (1)
Birth of Indian Cinema
26 pages
1-NLP - Lab Manual
No ratings yet
1-NLP - Lab Manual
15 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
38 pages
Machine Learning Question Paper Solved ML
No ratings yet
Machine Learning Question Paper Solved ML
55 pages
Dbms Lab Manual - II B.tech It Semii (2017-18)
No ratings yet
Dbms Lab Manual - II B.tech It Semii (2017-18)
83 pages
NLP - Exp 1 11
No ratings yet
NLP - Exp 1 11
29 pages
NLP Notes For Students
No ratings yet
NLP Notes For Students
18 pages
Unit 4 NLP
No ratings yet
Unit 4 NLP
51 pages
Kỳ Anh Kỳ Thi Chọn Học Sinh Giỏi Huyện LỚP 8 THCS NĂM HỌC 2020 - 2021 Môn thi: TIẾNG ANH
67% (3)
Kỳ Anh Kỳ Thi Chọn Học Sinh Giỏi Huyện LỚP 8 THCS NĂM HỌC 2020 - 2021 Môn thi: TIẾNG ANH
9 pages
NLP UNIT 2 (Ques Ans Bank)
No ratings yet
NLP UNIT 2 (Ques Ans Bank)
26 pages
NLP Assignment-1 Solution
No ratings yet
NLP Assignment-1 Solution
4 pages
Vtu NLP Questions
100% (1)
Vtu NLP Questions
5 pages
Machine Learning UNIT 1 PDF
100% (1)
Machine Learning UNIT 1 PDF
33 pages
Soal UTS Bahasa Inggris Kls 4
100% (3)
Soal UTS Bahasa Inggris Kls 4
3 pages
UI-UX Desig Lab Manual
100% (1)
UI-UX Desig Lab Manual
3 pages
NLP ORAL - Sample Question Bank: Modul e No. Sr. No - Description
No ratings yet
NLP ORAL - Sample Question Bank: Modul e No. Sr. No - Description
9 pages
Question Bank For Ai
0% (1)
Question Bank For Ai
2 pages
r18 - Big Data Analytics - Cse (DS)
0% (1)
r18 - Big Data Analytics - Cse (DS)
1 page
Compiler Design Model Lab Questions
No ratings yet
Compiler Design Model Lab Questions
4 pages
NLP QB
100% (2)
NLP QB
14 pages
Experiment: 1
No ratings yet
Experiment: 1
28 pages
Web Lab Question Bank
100% (1)
Web Lab Question Bank
2 pages
PAT Trees and PAT Arrays
No ratings yet
PAT Trees and PAT Arrays
12 pages
1) Explain in Detail Core Function of Edge Analytics With Diagram
No ratings yet
1) Explain in Detail Core Function of Edge Analytics With Diagram
13 pages
FLAT - UNIT 1 Notes
100% (2)
FLAT - UNIT 1 Notes
18 pages
CS8711 - Cloud Computing Laboratory Record: Department of Computer Science & Engineering
No ratings yet
CS8711 - Cloud Computing Laboratory Record: Department of Computer Science & Engineering
5 pages
101905CS502H - Neural Networks and Deep Learning - Model Question Paper
100% (1)
101905CS502H - Neural Networks and Deep Learning - Model Question Paper
4 pages
NLP Exp-123
No ratings yet
NLP Exp-123
6 pages
Data Analytics Unit-I
No ratings yet
Data Analytics Unit-I
25 pages
Atcd Model QP
0% (1)
Atcd Model QP
4 pages
Recursively Enumerable Languages
No ratings yet
Recursively Enumerable Languages
8 pages
NLP Record
No ratings yet
NLP Record
15 pages
Sonata Software Sample Aptitude Placement Paper Level1
No ratings yet
Sonata Software Sample Aptitude Placement Paper Level1
7 pages
Course End Survey DS - B SEC
No ratings yet
Course End Survey DS - B SEC
3 pages
DVT - Question Bank
100% (1)
DVT - Question Bank
3 pages
Artificial Intelligence Question Bank
100% (1)
Artificial Intelligence Question Bank
3 pages
Titik Nol English Course: Noun Clause
No ratings yet
Titik Nol English Course: Noun Clause
1 page
Critical Discourse Analysis of Sinhala Cultural Identity in Celebrity Endorsement in Newspaper Advertisements
No ratings yet
Critical Discourse Analysis of Sinhala Cultural Identity in Celebrity Endorsement in Newspaper Advertisements
16 pages
Enemy Pie: by Derek Munson Comprehension Questions and Projects For The Book
No ratings yet
Enemy Pie: by Derek Munson Comprehension Questions and Projects For The Book
10 pages
Certainly Uncertain Derrida
No ratings yet
Certainly Uncertain Derrida
34 pages
2024 Bi Ting 4 Kertas 4 Kertas Soalan
No ratings yet
2024 Bi Ting 4 Kertas 4 Kertas Soalan
6 pages
Cooperative Learning and Achievement in English Language Acquisition in A Literature Class in A Secondary School
No ratings yet
Cooperative Learning and Achievement in English Language Acquisition in A Literature Class in A Secondary School
139 pages
Bahasa Inggris BS KLS VIII 5
No ratings yet
Bahasa Inggris BS KLS VIII 5
65 pages
LearnEnglishKids Writing Practice Level 1 My Family
No ratings yet
LearnEnglishKids Writing Practice Level 1 My Family
4 pages
Grammar Tenses Master 2 Law
No ratings yet
Grammar Tenses Master 2 Law
3 pages
Edexcel Lit Poetry Tute 12 - The Tyger
No ratings yet
Edexcel Lit Poetry Tute 12 - The Tyger
21 pages
Cardinal Number: Nine People Injured Have Been Taken To The Hospital
No ratings yet
Cardinal Number: Nine People Injured Have Been Taken To The Hospital
3 pages
Internet Banking Java Project Report 5 PDF Free
No ratings yet
Internet Banking Java Project Report 5 PDF Free
68 pages
Text PDF
No ratings yet
Text PDF
79 pages
Cloud Computing and Distributed Systems
No ratings yet
Cloud Computing and Distributed Systems
1 page
82001
No ratings yet
82001
1 page
GM7 Hè - Day 15+ Mock Test
No ratings yet
GM7 Hè - Day 15+ Mock Test
4 pages
Malamoni Goswami
No ratings yet
Malamoni Goswami
16 pages
CD Lab File
No ratings yet
CD Lab File
28 pages
Analysis On The Techniques and Quality of News Translation On BBC Indonesia Website
No ratings yet
Analysis On The Techniques and Quality of News Translation On BBC Indonesia Website
17 pages
Grade 9 Reviewer 3rd Grading
No ratings yet
Grade 9 Reviewer 3rd Grading
32 pages
Satish Mahajan
No ratings yet
Satish Mahajan
3 pages
Genesee Second Lenguage
No ratings yet
Genesee Second Lenguage
13 pages
Phrasal Verbs Test I. Choose The Correct Verb
No ratings yet
Phrasal Verbs Test I. Choose The Correct Verb
1 page
Megan Rose Resume
No ratings yet
Megan Rose Resume
2 pages
Hyperbole Lesson Plan
No ratings yet
Hyperbole Lesson Plan
4 pages
Chapter 2 WHY TEACH GRAMMAR
No ratings yet
Chapter 2 WHY TEACH GRAMMAR
4 pages

NLP Lab Manual (R20)

Uploaded by

NLP Lab Manual (R20)

Uploaded by

NATURAL LANGUAGE PROCESSING WITH PYTHON LAB ROLL NO:

Roll No. ……………………. Of..……. Year B.tech ......................... Semester in

………………………………….Laboratory during the academic year 2022-2023.

Staff Member In charge Head of the Department

Submitted for the practical examination held on …………………

KRISHNACHAITANYA INSTITUTE OF TECHNOLOGY & SCIENCES::MARKAPUR

2 Perform lemmatization and stemming using python library nltk.

3 Demonstrate object standardization such as replace social media

4 Perform part of speech tagging on any textual data.

5 Implement topic modeling using Latent Dirichlet Allocation (LDA )

6 Demonstrate Term Frequency – Inverse Document Frequency

7 Demonstrate word embedding’s using word2vec.

8 Implement Text classification using naïve bayes classifier and

9 Apply support vector machine for text classification

10 Convert text to vectors (using term frequency) and apply cosine

text = re.sub(r'http\S+', '', text)

text = re.sub(r'@\S+', '', text)

text = re.sub(r'#\S+', '', text)

# Remove punctuation and other non-alphanumeric characters

text = re.sub(r'[^\w\s]', '', text)

# Remove extra whitespace

text = re.sub(r'\s+', ' ', text).strip()

Just had the best coffee from!

fromnltk.stem import PorterStemmer

Stemming a Complete Sentence

fromnltk.tokenize import sent_tokenize, word_tokenize

lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome",

"luv" :"love","hlo":"hello","<3":"â™¡","aa":"allu arjun","ths":"this",

"gn":"goodnight","gm":"good morning","ga":"good afternoon",

for word in words:

new_text = " ".join(new_words)

print("THE CONVERTED MESSAGE IS AS FOLLOWS >>:")

<3 <3 <3

" hlo everyone!!

I had got so much luv from you all!!

tq for all your awsm luv and affection

i am soo happy to get great luv from u all ,

THE CONVERTED MESSAGE IS AS FOLLOWS >>:

# sent_tokenize is one of instances of

# Word tokenizers is used to find the words

# removing stop words from wordList

from gensim import corpora

doc1 = "hello world"

doc2 = "world news"

doc3 = "news update"

doc4 = "world update"

doc5 = "hello update"

documents = [doc1, doc2, doc3, doc4, doc5]

stopwords = set('for a of the and to in'.split())

# Creating dictionary and corpus

corpus = [dictionary.doc2bow(text) for text in texts]

# LDA model training

lda_model = gensim.models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=2,

for topic_id, topic in lda_model.show_topics(num_topics=2, num_words=3):

print(f"Topic {topic_id+1}: {topic}")

Topic 1: 0.263*"world" + 0.256*"hello" + 0.252*"update"

Topic 2: 0.387*"news" + 0.296*"update" + 0.196*"world"

from sklearn.feature_extraction.text import TfidfVectorizer

doc1 = "hello world"

doc2 = "world news"

doc3 = "news update"

doc4 = "world update"

doc5 = "hello update"

documents = [doc1, doc2, doc3, doc4, doc5]

# Creating TfidfVectorizer object

# Fitting and transforming the corpus

# Getting feature names and IDF scores

for i, doc in enumerate(documents):

print(f"Document {i+1}: {doc}")

for j, word in enumerate(feature_names):

print(f" {word}: TF-IDF={tfidf_score:.3f}, IDF={idf_score:.3f}, TF={tf_score:.3f}")

Document 1: hello world

hello: TF-IDF=0.707, IDF=1.099, TF=0.643

news: TF-IDF=0.000, IDF=1.099, TF=0.000

update: TF-IDF=0.000, IDF=1.099, TF=0.000

world: TF-IDF=0.707, IDF=1.099, TF=0.643

Topic 1: 0.263"world" + 0.256"hello" + 0.252*"update"

Topic 2: 0.387"news" + 0.296"update" + 0.196*"world"