0% found this document useful (0 votes)

31 views21 pages

NLP Tushar

The document discusses preparing bag of word and TF-IDF models in Python. It shows code to: 1) Clean text data by removing punctuation and making words lowercase 2) Create a vocabulary set of unique words from multiple documents 3) Calculate bag of words representations by counting word frequencies in each document 4) Use scikit-learn's CountVectorizer to vectorize text into feature matrices 5) Calculate TF-IDF weights using scikit-learn's TfidfVectorizer

Uploaded by

Yash Amin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views21 pages

NLP Tushar

Uploaded by

Yash Amin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 21

11. Write a python program to prepare a Bag of word Model.

Code :-

import pandas as pd

import numpy as np

import collections
import re

doc1 = "Game of Thrones is an amazing tv series"

doc2 = "Game of Thrones is best tv series"

doc3 = "Game of Thrones is so great"

#Remove Punctuation

l_doc1 = re.sub(r"[^a-zA-Z0-9]"," ",doc1.lower()).split()

l_doc2 = re.sub(r"[^a-zA-Z0-9]"," ",doc2.lower()).split()

l_doc3 = re.sub(r"[^a-zA-Z0-9]"," ",doc3.lower()).split()

#After we achieve the vocabulary, or wordset, which is composed of the unique

words founds in the three reviews

wordset12 = np.union1d(l_doc1,l_doc2)

wordset = np.union1d(wordset12,l_doc3)
print(wordset)

Tushar Parikh 21084341003

14
def calculateBOW(wordset,l_doc) :

tf_diz = dict.fromkeys(wordset,0)

for word in l_doc :

tf_diz[word] = l_doc.count(word)
return tf_diz

#We can finally obtain the bag of words representatives for the reviews. In the
end, we obtain a dataframe, where each row corresponds to the extracted
features of each document

bow1 = calculateBOW(wordset,l_doc1)

bow2 = calculateBOW(wordset,l_doc2)

bow3 = calculateBOW(wordset,l_doc3)

df_bow = pd.DataFrame([bow1,bow2,bow3])
df_bow.head()

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

Tushar Parikh 21084341003

15
x = vectorizer.fit_transform([doc1,doc2,doc3])
print(vectorizer.get_feature_names())

df_bow_sklearn = pd.DataFrame(x.toarray(),columns = vectorizer.get_feature_names())

df_bow_sklearn.head()

vectorizer = CountVectorizer(stop_words = "english")

x = vectorizer.fit_transform([doc1,doc2,doc3])
print(vectorizer.get_feature_names())

df_bow_sklearn=pd.DataFrame(x.toarray(),columns=vectorizer.get_feature_names())

df_bow_sklearn.head()

Tushar Parikh 21084341003

16
vectorizer=CountVectorizer(stop_words="english",ngram_range=(2,2))

x=vectorizer.fit_transform([doc1,doc2,doc3])
print(vectorizer.get_feature_names())

df_bow_sklearn=pd.DataFrame(x.toarray(),columns=vectorizer.get_feature_names())

df_bow_sklearn.head()

import pandas as pd
dataset = pd.read_csv(r”c:\Users\HP\data.csv”,encoding=”ISO-8859-1”)
dataset.head()

Tushar Parikh 21084341003

17
import re
import nltk

from nltk.steam.porter import PorterStemmer

stemmer = PorterStemmer()

#spell correction
from nltk.corpus import stopwords
Data = []

for i in range(dataset.shape[0]) :
sms = dataset.iloc[I,1]
#remove non alphabetic characters
sms = re.sub(^[A-Za-z],’ ‘ ,sms)

#make words lowercase, because Go and go will be considered as two

words
sms = sms.lover()

#tokenising
tokenized_sms = wt(sms)

Tushar Parikh 21084341003

18
#remove stop words and stemming

sms_processed = []
for word in tokenized_sms :
if word not in set(stopwords.words(‘english’)) :
sms_processed.append(stemmer.stem(word))

sms_text = “ “.join(sms_processed)
data.append(sms_text)

# creating the feature matrix

from sklearn.feature_extraction.text import CountVectorizer
matrix = CountVectorizer(max_features=1000)
x = matrix.fit_transform(data).toarray()
y = dataset.iloc[:,0]

# split train and test data

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y)

# Naive Bayes
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(x_train, y_train)

Tushar Parikh 21084341003

19
# predict class
y_pred = classifier.predict(x_test)

# Confusion matrix
from sklearn.metrics import confusion_matrix, classification_report,
accuracy_score
cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred)

accuracy = accuracy_score(y_test, y_pred)

accuracy

Tushar Parikh 21084341003

20
12. Write a python program to prepare a TF-IDF Model.
Code :-

from sklearn.feature_extraction.text import TfidfVectorizer

d0 = "The car is driven on the road"

d1 = "The truck is driven on the highway"
d2 = "The bike is run on road"

string = [d0,d1,d2]

tfidf = TfidfVectorizer()

result = tfidf.fit_transform(string)
result

print("\nword indices : ")

print(tfidf.vocabulary_)

Tushar Parikh 21084341003

21
print("\nidf values : ")
for ele1,ele2 in zip(tfidf.get_feature_names(),tfidf.idf_) :
print(ele1,":",ele2)

print("\ntf.idf values : ")

print(result)

Tushar Parikh 21084341003

22
print("\ntf.idf values i matrix form : ")
print(result.toarray())

Tushar Parikh 21084341003

23
13. Write a python program to prepare a CountVectorizer Model.
Code :-
from sklearn.feature_extraction.text import CountVectorizer

# To create a Count Vectorizer, we simply need to instantiate one.

# There are special parameters we can set here when making the vectorizer,
but
# for the most basic example, it is not needed

vectorizer = CountVectorizer()

# For our text, we are going to take some text form our previous blog post
# about count vectorization

sample_text = ["One of the most basic ways we can numerically represent

words "
"is through the one-hot encoding method (also sometimes called "
"count vectorizing)."]

# To actually create the vectorizer, we simply need to call fit on the text
# data that we wish to fix

vectorizer.fit(sample_text)

Tushar Parikh 21084341003

24
# Now, we can inspect how our vectorizer vectorized the text
# This will print out a list of words used, and their index in the vectors
print("Vacabulary : ")
print(vectorizer.vocabulary_)

# If we would like to actually create a vector, we can do so by passing the

# text into the vectorizer to get back counts
vector = vectorizer.transform(sample_text)

# Our final vector :

print("Full vector : ")
print(vector.toarray())

# Or if we wanted to get the vectore for one word :

print("Hot vector : ")
print(vectorizer.transform(['hit']).toarray())

#or if we wanted to get multiple vectors at once to build matrices

print("Hot and One : ")
print(vectorizer.transform(['hot','one']).toarray())

# We could also do the whole thing at once with the fit_transform method :
print('One swoop : ')
new_text = ["Today is the day that I do the thing today, today"]
new_vectorizer = CountVectorizer()

Tushar Parikh 21084341003

25
print(new_vectorizer.fit_transform(new_text).toarray())

Tushar Parikh 21084341003

26
14. Write a python program to perform Text Classification with NLTK using
Naive Bayes Classifier.
Code :-
import numpy as np
import pandas as pd

df = pd.read_csv("C:\\Users\\Admin\\Downloads\\BBC_News_Train.csv")
df.head()

df.shape

df['Category'].value_counts()

import nltk
from nltk.corpus import stopwords
import string

Tushar Parikh 21084341003

27
def text_cleaning(a) :
remove_punctuation = [char for char in a if char not in string.punctuation]
remove_punctuation = ''.join(remove_punctuation)
return [word for word in remove_punctuation.split() if word.lower() not in
stopwords.words('english')]

print(df.iloc[:,1].apply(text_cleaning))

from sklearn.feature_extraction.text import CountVectorizer

bow_transformer = CountVectorizer(analyzer=text_cleaning).fit(df['Text'])
bow_transformer.vocabulary_

Tushar Parikh 21084341003

28
title_bow = bow_transformer.transform(df['Text'])
print(title_bow)

Tushar Parikh 21084341003

29
x = title_bow.toarray()
print(x)
x.shape

from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer().fit(title_bow)
print(tfidf_transformer)

title_tfidf = tfidf_transformer.transform(title_bow)
print(title_tfidf)
print(title_tfidf.shape)

Tushar Parikh 21084341003

30
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB().fit(title_tfidf,df['Category'])

all_predictions = model.predict(title_tfidf)
print(all_predictions)

from sklearn.metrics import confusion_matrix

confusion_matrix(df['Category'],all_predictions)

from sklearn.metrics import classification_report

print(classification_report(df['Category'], all_predictions))

Tushar Parikh 21084341003

31
15. Write a python program to converting words to features with NLTK.
Code :-
import nltk
nltk.download('movie_reviews')
import random
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(field)), category)

for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

all_words = []

for w in movie_reviews.words() :
all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]

def find_features(documnet) :
words = set(documents)
features = {}
for w in word_features :

Tushar Parikh 21084341003

32
features[w] = (w in words)

return features

print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))

featuresets = [(find_features(rev),category) for (rev,category) in documents]

featuresets

Tushar Parikh 21084341003

33
Tushar Parikh 21084341003
34

Extra Feature NLP
No ratings yet
Extra Feature NLP
5 pages
WPF Simplified Build Windows Apps Using Csharp and Xaml
No ratings yet
WPF Simplified Build Windows Apps Using Csharp and Xaml
789 pages
Parts of Speech Tagger
No ratings yet
Parts of Speech Tagger
12 pages
Combine PDF
No ratings yet
Combine PDF
124 pages
Lab 6
No ratings yet
Lab 6
47 pages
Glove
100% (1)
Glove
10 pages
Data Mining Numericals
No ratings yet
Data Mining Numericals
38 pages
Camm 4e Ch01 PPT
No ratings yet
Camm 4e Ch01 PPT
48 pages
Tugas NLP - 1152000052 1
No ratings yet
Tugas NLP - 1152000052 1
14 pages
ML Assignment 4
No ratings yet
ML Assignment 4
10 pages
Ir Practical Manual 2
No ratings yet
Ir Practical Manual 2
24 pages
AI Lab Manual Aktu
No ratings yet
AI Lab Manual Aktu
11 pages
Brihat Jataka 2nd Ed. by V Subrahmanya Sastri - Text
No ratings yet
Brihat Jataka 2nd Ed. by V Subrahmanya Sastri - Text
588 pages
Assignment Lecture1 Venus KarmaRelationship-V2.0
No ratings yet
Assignment Lecture1 Venus KarmaRelationship-V2.0
4 pages
Aped For Fake News
No ratings yet
Aped For Fake News
6 pages
Natural Language Processing
No ratings yet
Natural Language Processing
5 pages
ML Week10.1
No ratings yet
ML Week10.1
5 pages
Shreya Srivastava-27
No ratings yet
Shreya Srivastava-27
3 pages
NLP Manual
No ratings yet
NLP Manual
21 pages
17 - Source Code - nlp-2-5
No ratings yet
17 - Source Code - nlp-2-5
4 pages
NIST Security Reference Architecture 2013.05.15 v1.0
No ratings yet
NIST Security Reference Architecture 2013.05.15 v1.0
204 pages
Pertemuan 4 - Fature Extraction
No ratings yet
Pertemuan 4 - Fature Extraction
18 pages
NLP - Assignment2 Proper RNN Working
No ratings yet
NLP - Assignment2 Proper RNN Working
3 pages
CS-875-Lecture 4
No ratings yet
CS-875-Lecture 4
47 pages
IR Prac 5
No ratings yet
IR Prac 5
3 pages
GS33J05D10-01EN Standard Operation and Monitoring Function
No ratings yet
GS33J05D10-01EN Standard Operation and Monitoring Function
13 pages
NLP Crecord Mid2
No ratings yet
NLP Crecord Mid2
36 pages
NLP
No ratings yet
NLP
4 pages
NLP Lab
No ratings yet
NLP Lab
18 pages
تمثيل النص كموترات - تدريب - مايكروسوفت ليرن
No ratings yet
تمثيل النص كموترات - تدريب - مايكروسوفت ليرن
14 pages
Report On - Social Media Research Topic Modeling
No ratings yet
Report On - Social Media Research Topic Modeling
26 pages
Text, Pos, Wor2vec
No ratings yet
Text, Pos, Wor2vec
3 pages
College Name Number OF Student Name Roll No Branch Name Status
No ratings yet
College Name Number OF Student Name Roll No Branch Name Status
9 pages
NLP Lab Programs
No ratings yet
NLP Lab Programs
18 pages
Assignment
No ratings yet
Assignment
6 pages
Module 2 Feature Engineering and Text Representation
No ratings yet
Module 2 Feature Engineering and Text Representation
19 pages
Lab 78
No ratings yet
Lab 78
6 pages
NLP Assignment (917722H031)
No ratings yet
NLP Assignment (917722H031)
18 pages
ASTW RA03 PracticalManual
No ratings yet
ASTW RA03 PracticalManual
18 pages
Group 4 MovieReview
No ratings yet
Group 4 MovieReview
10 pages
NLP Record
No ratings yet
NLP Record
16 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
I041 NLP Assignment5
No ratings yet
I041 NLP Assignment5
12 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
Self Evaluation Exercises
No ratings yet
Self Evaluation Exercises
12 pages
Information Retrival
No ratings yet
Information Retrival
43 pages
1 Apache Zookeeper
No ratings yet
1 Apache Zookeeper
7 pages
TCM - Imp BR
No ratings yet
TCM - Imp BR
40 pages
1a NLTK
No ratings yet
1a NLTK
10 pages
AI Phash3
No ratings yet
AI Phash3
11 pages
Implemention of Sms Spam Filtering
No ratings yet
Implemention of Sms Spam Filtering
27 pages
Ai Lab Final
No ratings yet
Ai Lab Final
21 pages
Email Spam Classifier
No ratings yet
Email Spam Classifier
22 pages
Specifications For 437DUO: Compliance Standard
No ratings yet
Specifications For 437DUO: Compliance Standard
5 pages
The Islamia College of Science & Commerce, Srinagar - J &K Department of Computer Applica Tions
No ratings yet
The Islamia College of Science & Commerce, Srinagar - J &K Department of Computer Applica Tions
15 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
Python CA 4
No ratings yet
Python CA 4
9 pages
Basenlp
No ratings yet
Basenlp
5 pages
Methodology
No ratings yet
Methodology
9 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
Cs Important Questions by Ujjwal
No ratings yet
Cs Important Questions by Ujjwal
19 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Prajakta T. Lakhmapure - Shivaji Science College, Nagpur
No ratings yet
Prajakta T. Lakhmapure - Shivaji Science College, Nagpur
22 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Breaking The Records
No ratings yet
Breaking The Records
3 pages
Module III
No ratings yet
Module III
42 pages
2024 08 15 Traffic Analysis Exercise Answers
No ratings yet
2024 08 15 Traffic Analysis Exercise Answers
10 pages
What Is IoT Device Manufacturing
No ratings yet
What Is IoT Device Manufacturing
10 pages
Grlib
No ratings yet
Grlib
92 pages
NLP Soc
No ratings yet
NLP Soc
15 pages
Done DS GTU Study Material Presentations Unit-4 13032021035653AM
No ratings yet
Done DS GTU Study Material Presentations Unit-4 13032021035653AM
24 pages
Catia V5 Fundamentals
100% (2)
Catia V5 Fundamentals
53 pages
NLP Record 2
No ratings yet
NLP Record 2
18 pages
ML2 Practical List
No ratings yet
ML2 Practical List
80 pages
NLP Asgn3
No ratings yet
NLP Asgn3
6 pages
Card Brands - Visa Update - Migration To The Eight-Digit BIN - Support Services PDF
No ratings yet
Card Brands - Visa Update - Migration To The Eight-Digit BIN - Support Services PDF
2 pages
NLP - Practical List
No ratings yet
NLP - Practical List
14 pages
SL-3 - Assignment No 7
No ratings yet
SL-3 - Assignment No 7
14 pages
CRUD Tutorial With Search
No ratings yet
CRUD Tutorial With Search
17 pages
Ss Ebook MigratingToTheCloud PDF
No ratings yet
Ss Ebook MigratingToTheCloud PDF
11 pages
E-Commerce Application
No ratings yet
E-Commerce Application
7 pages
Prof Sanket J Shah: Prepared By: Alpha College of Enginering & Technology
No ratings yet
Prof Sanket J Shah: Prepared By: Alpha College of Enginering & Technology
31 pages
Jesal Patel Internship 3eY5UNvdBs
No ratings yet
Jesal Patel Internship 3eY5UNvdBs
9 pages
Phishing Domain Detection - Updated
No ratings yet
Phishing Domain Detection - Updated
5 pages
Next Word Prediction With NLP and Deep Learning
No ratings yet
Next Word Prediction With NLP and Deep Learning
13 pages
Natural Language Processing
No ratings yet
Natural Language Processing
22 pages
External
No ratings yet
External
15 pages
Q BANK - Microcontrollers
No ratings yet
Q BANK - Microcontrollers
4 pages
TT 04-03-22 Onwards
No ratings yet
TT 04-03-22 Onwards
1 page
Coupon - WCFM Documentation
No ratings yet
Coupon - WCFM Documentation
14 pages
S4ABAP Curriculum Classroom Vendor
No ratings yet
S4ABAP Curriculum Classroom Vendor
25 pages
Blue Link
No ratings yet
Blue Link
2 pages
Programming in C Unit III
No ratings yet
Programming in C Unit III
19 pages
Delhi Public School Bangalore East Subject: Computer Science Chapter: Log On To Access (Answer Key) Grade Viii
No ratings yet
Delhi Public School Bangalore East Subject: Computer Science Chapter: Log On To Access (Answer Key) Grade Viii
2 pages
Communication Protocols in Distributed Systems
No ratings yet
Communication Protocols in Distributed Systems
7 pages
Hiiiijhi
No ratings yet
Hiiiijhi
14 pages

NLP Tushar

Uploaded by

NLP Tushar

Uploaded by

11. Write a python program to prepare a Bag of word Model.

doc1 = "Game of Thrones is an amazing tv series"

doc2 = "Game of Thrones is best tv series"

l_doc1 = re.sub(r"[^a-zA-Z0-9]"," ",doc1.lower()).split()

l_doc2 = re.sub(r"[^a-zA-Z0-9]"," ",doc2.lower()).split()

#After we achieve the vocabulary, or wordset, which is composed of the unique

Tushar Parikh 21084341003

for word in l_doc :

from sklearn.feature_extraction.text import CountVectorizer

Tushar Parikh 21084341003

df_bow_sklearn = pd.DataFrame(x.toarray(),columns = vectorizer.get_feature_names())

vectorizer = CountVectorizer(stop_words = "english")

Tushar Parikh 21084341003

Tushar Parikh 21084341003

from nltk.steam.porter import PorterStemmer

#make words lowercase, because Go and go will be considered as two

Tushar Parikh 21084341003

# creating the feature matrix

# split train and test data

Tushar Parikh 21084341003

accuracy = accuracy_score(y_test, y_pred)

Tushar Parikh 21084341003

from sklearn.feature_extraction.text import TfidfVectorizer

d0 = "The car is driven on the road"

print("\nword indices : ")

Tushar Parikh 21084341003

print("\ntf.idf values : ")

Tushar Parikh 21084341003

Tushar Parikh 21084341003

# To create a Count Vectorizer, we simply need to instantiate one.

sample_text = ["One of the most basic ways we can numerically represent

Tushar Parikh 21084341003

# If we would like to actually create a vector, we can do so by passing the

# Our final vector :

# Or if we wanted to get the vectore for one word :

#or if we wanted to get multiple vectors at once to build matrices

Tushar Parikh 21084341003

Tushar Parikh 21084341003

Tushar Parikh 21084341003

from sklearn.feature_extraction.text import CountVectorizer

Tushar Parikh 21084341003

Tushar Parikh 21084341003

from sklearn.feature_extraction.text import TfidfTransformer

Tushar Parikh 21084341003

from sklearn.metrics import confusion_matrix

from sklearn.metrics import classification_report

Tushar Parikh 21084341003

documents = [(list(movie_reviews.words(field)), category)

Tushar Parikh 21084341003

featuresets = [(find_features(rev),category) for (rev,category) in documents]

Tushar Parikh 21084341003

You might also like