0% found this document useful (0 votes)
19 views

Amazon Assignment Ex

The document discusses performing sentiment analysis on Amazon fine food reviews. It describes the dataset containing over 500,000 reviews with 10 features each. It then performs preprocessing steps like filtering reviews with a score of 3, partitioning scores above 3 as positive and below 3 as negative, dropping duplicate entries, and transforming the text into count vectors for analysis.

Uploaded by

Ram R
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Amazon Assignment Ex

The document discusses performing sentiment analysis on Amazon fine food reviews. It describes the dataset containing over 500,000 reviews with 10 features each. It then performs preprocessing steps like filtering reviews with a score of 3, partitioning scores above 3 as positive and below 3 as negative, dropping duplicate entries, and transforming the text into count vectors for analysis.

Uploaded by

Ram R
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Sentiment Classification on Amazon Fine Food Reviews

Objective : Objective of this exercise is to perform the sentiment analysis on Amazon Fine Food Reviews.

Data Analysis : Amazon Fine Food Review is a large data set with around 568K reviews given by the customers on the Amazon food
products. Each review has got the following 10 features.

1. ID
2. ProductId - Unique Identifier on the product.
3. UserId - Unique Identifier for the user.
4. ProfileName
5. HelpfulnessNumerator - Number of users who found the review helpful.
6. HelpfulnessDenominator - Number of users who indicated whether the review they found helpful.
7. Score - Rating between 1 to 5.
8. Time - Timestamp for the review.
9. Summary - Brief summary of the review.
10. Text - Text of the review.

In [4]:
import os
os.getcwd()
os.chdir(r"C:\Users\Sujatha\Applied AI Course\Ramesh\Data")
os.getcwd()

Out[4]:
'C:\\Users\\Sujatha\\Applied AI Course\\Ramesh\\Data'

In [2]:
%matplotlib inline

import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer


from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

In [3]:

#Reading the SQL data using SQLite

con = sqlite3.connect('./amazon-fine-food-reviews/database.sqlite')

filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews where Score != 3 """, con)


filtered_data = filtered_data[0:3500]

In [4]:

filtered_data.head()

Out[4]:

Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenominator Score Time Summary


Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenominator Score Time Summary
Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenominator Score Time Summary

Good
0 1 B001E4KFG0 A3SGXH7AUHU8GW delmartian 1 1 5 1303862400 Quality
Dog Food

Not as
1 2 B00813GRG4 A1D87F6ZCVE5NK dll pa 0 0 1 1346976000
Advertised

Natalia
Corres "Delight"
2 3 B000LQOCH0 ABXLMWJIXXAIN 1 1 4 1219017600
"Natalia says it all
Corres"

Cough
3 4 B000UA0QIQ A395BORC6FGVXV Karl 3 3 2 1307923200
Medicine

Michael D.
4 5 B006K2ZZ7K A1UQRSCLF8GW1T Bigham "M. 0 0 5 1350777600 Great taffy
Wassir"

I limited my data to be 3500 reviews only as my machine configuration is small and is not supporting for large data.

In [5]:
filtered_data = filtered_data[0:3500]
filtered_data.shape

Out[5]:
(3500, 10)

The Score above 3 is taken as Positive and below 3 is taken as negative and equal to 3 is neglected.

In [6]:

def partition(y):
if y > 3:
return "postive"
return "negative"

In [7]:
actualscore = filtered_data["Score"]
positivenegative = actualscore.map(partition)
filtered_data['Score'] = positivenegative

In [8]:
filtered_data.shape

Out[8]:
(3500, 10)

In [9]:
filtered_data.head()

Out[9]:

Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenominator Score Time

Good
0 1 B001E4KFG0 A3SGXH7AUHU8GW delmartian 1 1 postive 1303862400 Quality
Dog Foo

Not as
1 2 B00813GRG4 A1D87F6ZCVE5NK dll pa 0 0 negative 1346976000
Advertis

Natalia
Corres "Delight"
2 3 B000LQOCH0 ABXLMWJIXXAIN 1 1 postive 1219017600
"Natalia says it a
Corres"

Cough
3 4 B000UA0QIQ A395BORC6FGVXV Karl 3 3 negative 1307923200
Medicine

Michael D.
4 5 B006K2ZZ7K A1UQRSCLF8GW1T Bigham "M. 0 0 postive 1350777600 Great ta
Wassir"

In [10]:
display = pd.read_sql_query("""SELECT * FROM Reviews where score!=3 and UserId like '%AR5J8U%' ord
er by ProductID""",con)

In [11]:
display.head()

Out[11]:

Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenominator Score Time

Geetha
0 78445 B000HDL1RQ AR5J8UI46CURR 2 2 5 1199577600
Krishnan

Geetha
1 138317 B000HDOPYC AR5J8UI46CURR 2 2 5 1199577600
Krishnan
Krishnan
Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenominator Score Time

Geetha
2 138277 B000HDOPYM AR5J8UI46CURR 2 2 5 1199577600
Krishnan

Geetha
3 73791 B000HDOPZG AR5J8UI46CURR 2 2 5 1199577600
Krishnan

Geetha
4 155049 B000PAQ75C AR5J8UI46CURR 2 2 5 1199577600
Krishnan

In [12]:
sorted_data = filtered_data.sort_values("ProductId",axis = 0,ascending = True)

In [13]:
#Duplication of entries.
final = sorted_data.drop_duplicates(subset = {"UserId","ProfileName","Time",'Text'}, keep = "first"
,inplace = False)

In [14]:
final.shape

Out[14]:
(3491, 10)

In [15]:

(final['Id'].size*1.0)/(filtered_data['Id'].size*1.0)

Out[15]:
0.9974285714285714

In [16]:

final = final[final.HelpfulnessNumerator <= final.HelpfulnessDenominator]


final.shape

Out[16]:
(3491, 10)

In [17]:
final['Score'].value_counts()

Out[17]:
postive 2909
negative 582
Name: Score, dtype: int64

In [18]:
count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(final['Text'].values)

In [19]:
final_counts.shape

Out[19]:
(3491, 11010)

In [20]:
import re
i=0
for sent in final['Text'].values:
if(len(re.findall("<.*?",sent))):
print(i)
print(sent)
break;
i+=1

0
Why is this $[...] when the same product is available for $[...] here?<br
/>http://www.amazon.com/VICTOR-FLY-MAGNET-BAIT-REFILL/dp/B00004RBDY<br /><br />The Victor M380 and
M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.

In [21]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to


[nltk_data] C:\Users\Sujatha\AppData\Roaming\nltk_data...
[nltk_data] Package stopwords is already up-to-date!

Out[21]:
True

In [22]:

import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

stop = set(stopwords.words('english')) #set of stopwords


sno = nltk.stem.SnowballStemmer('english') #initialising the snowball stemmer

print(stop)

{'their', 'own', 'just', "wasn't", 'hadn', 'has', 'each', 'i', 'by', "haven't", 'again',
'yourselves', "hasn't", "she's", 's', 'why', "should've", 'will', 'ma', 'into', 'him', 'its', 'are
', 'off', 'd', 'his', 'where', 'mustn', 'didn', 'an', 'both', "that'll", 'doesn', 'such', 'were',
'with', 'ain', "aren't", 'do', 'himself', 'being', 'is', 'any', 'our', 'was', 'on', 'needn', 'for'
, 'which', 'these', 'few', 'between', "wouldn't", 'ours', 'all', 'they', 'or', 'very', 'if',
'now', 'doing', 'other', 'after', 'won', "weren't", "you've", 'ourselves', 'have', 'yours', 'then'
, 'whom', "mightn't", 'he', 'hers', "hadn't", 'what', 'she', 'those', 'yourself', 'in', 'but', 'as
', 'above', 'while', 'below', "you'd", 'from', 'most', 'up', 'me', 'when', 'themselves', 'wasn', "
didn't", 'myself', 'be', 'her', 't', 'because', 'itself', 'o', 'your', 're', 'haven', 'how',
'not', 'shan', 'this', 'can', 'same', 'm', 'theirs', 'been', 'more', 'some', 'shouldn', 'my',
'during', "shouldn't", 'so', "couldn't", 'we', 'mightn', 'hasn', 'through', 'wouldn', 'until', "yo
u're", "you'll", "it's", 'out', 'should', 'that', 'only', 'weren', 'a', 'once', "mustn't", 'aren',
'having', "needn't", 'it', 'herself', 've', "shan't", 'before', 'who', "won't", "isn't", 'and', 't
he', 'to', 'nor', 'than', 'against', 'under', 'over', 'had', 'y', 'them', 'll', 'at', 'about', 'th
ere', 'don', 'am', 'you', "doesn't", 'further', 'no', 'down', 'too', 'isn', 'here', 'couldn',
'of', 'did', "don't", 'does'}

Cleaning the data like special characters,html tags etc..


Cleaning the data like special characters,html tags etc..

In [23]:
def cleanhtml(sentence):
cleanr = re.compile('<.*?')
cleantext = re.sub(cleanr, ' ', sentence)
return cleantext

def cleanpunc(sentence):
cleaned = re.sub(r'[?|!|\|"|#]',' ',sentence)
cleaned = re.sub(r'[.|,|)|(|\|/]',' ',cleaned)
return cleaned

print(sno.stem('tasty'))

tasti

In [24]:
i=0
strl = ' '
final_string = []
all_positive_words = [ ]
all_negative_words = [ ]
s = ' '
for sent in final['Text'].values:
filtered_sentence = []
sent = cleanhtml(sent)
for w in sent.split():
for cleaned_words in cleanpunc(w).split():
if(((cleaned_words.isalpha()& (len(cleaned_words)>2)))):
if(cleaned_words.lower() not in stop):
s = (sno.stem((cleaned_words.lower()))).encode('utf8')
filtered_sentence.append(s)
if(final['Score'].values)[i]== 'positive':
all_positive_words.append(s)
if(final['Score'].values)[i]== 'negative':
all_negative_words.append(s)
else:
continue
else:
continue
strl = b" ".join(filtered_sentence)
final_string.append(strl)
i+=1

In [25]:
final['CleanedText'] = final_string

final.head()

Out[25]:

Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenominator Score Time

2546 2774 B00002NCJC A196AJHU9EASJN Alex Chaffee 0 0 postive 1282953600

2547 2775 B00002NCJC A13RRPGE79XFFH reader48 0 0 postive 1281052800


Id B00002Z754
1145 1244 ProductId A3B8RCEI0FXFI6
UserId ProfileName
B G Chase HelpfulnessNumerator
10 HelpfulnessDenominator
10 Score 962236800
postive Time

1146 1245 B00002Z754 A29Z5PI9BW2PU3 Robbie 7 7 postive 961718400

Glenna E.
Bauer
2941 3203 B000084DVR A3DKGXWUEP1AI2 3 3 postive 1163030400
"Puppy
Mum"

In [26]:

con = sqlite3.connect('final_sqlite')
c = con.cursor()
con.text_factory = str
final.to_sql('Reviews',con, if_exists='replace', index=True, index_label=None, chunksize=None,
dtype=None)

Bigram
In [27]:

#Bigrams
count_vect = CountVectorizer(ngram_range = (1,2))
final_bigram_counts = count_vect.fit_transform(final['Text'].values)
final_bigram_counts.get_shape()

Out[27]:
(3491, 111023)

In [30]:
#t-SNE for Bigrams data.
from sklearn.manifold import TSNE
import seaborn as sn

model = TSNE(n_components = 2, random_state = 0)


data_3500 = final_bigram_counts[0:4500,:]
data_3500 = data_3500.toarray()
lbl1 = final['Score']
lbl1 = lbl1[0:4500]
tsne_data =model.fit_transform(data_3500)
tsne_data = np.vstack((tsne_data.T, lbl1)).T

tsne_df = pd.DataFrame(data=tsne_data, columns=("Dim_1", "Dim_2", "lbl1"))


sn.FacetGrid(tsne_df, hue="lbl1", size=6).map(plt.scatter, 'Dim_1', 'Dim_2').add_legend()
plt.show()
TF IDF
In [31]:

#TF IDF

tf_idf_vect = TfidfVectorizer(ngram_range = (1,2))


final_tf_idf = tf_idf_vect.fit_transform(final['Text'])
final_tf_idf.shape

Out[31]:
(3491, 111023)

In [32]:

#t-SNE for TF IDF data.


from sklearn.manifold import TSNE
import seaborn as sn
model = TSNE(n_components = 2, random_state = 0)
data_3500 = final_tf_idf[0:3000,:]
data_3500 = data_3500.toarray()
lbl1 = final['Score']
lbl1 = lbl1[0:3000]
tsne_data =model.fit_transform(data_3500)
tsne_data = np.vstack((tsne_data.T, lbl1)).T
#print(lbl1.shape)
#print(data_3500.shape)
#print(tsne_data.shape)
tsne_df = pd.DataFrame(data=tsne_data, columns=("Dim_1", "Dim_2", "lbl1"))
sn.FacetGrid(tsne_df, hue="lbl1", size=6).map(plt.scatter, 'Dim_1', 'Dim_2').add_legend()
plt.show()
In [33]:

import gensim
i= 0
list_of_sent = []
for sent in final['Text']:
filtered_sentence = []
sent = cleanhtml(sent)
for w in sent.split():
for cleaned_words in cleanpunc(w).split():
if(cleaned_words.isalpha()):
filtered_sentence.append(cleaned_words.lower())
else:
continue
list_of_sent.append(filtered_sentence)
print(final['Text'].values[0])
print(list_of_sent[0])

C:\Users\Sujatha\Anaconda3\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows;


aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

Why is this $[...] when the same product is available for $[...] here?<br
/>http://www.amazon.com/VICTOR-FLY-MAGNET-BAIT-REFILL/dp/B00004RBDY<br /><br />The Victor M380 and
M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.
['why', 'is', 'this', 'when', 'the', 'same', 'product', 'is', 'available', 'for', 'here', 'br',
'www', 'amazon', 'com', 'dp', 'br', 'br', 'victor', 'and', 'traps', 'are', 'unreal', 'of',
'course', 'total', 'fly', 'genocide', 'pretty', 'stinky', 'but', 'only', 'right', 'nearby']

Word2Vec
In [34]:

from gensim.models import Word2Vec


w2v_model = gensim.models.Word2Vec(list_of_sent,min_count = 5,size = 30,workers =4)
w2v_model

Out[34]:
<gensim.models.word2vec.Word2Vec at 0x1e5e6dd5b38>

In [35]:

words=list(w2v_model.wv.vocab)
print(len(words))

10307

In [36]:
from sklearn.manifold import TSNE
import seaborn as sn

#words1 = gensim.W2VTransformer(final['Text'])
model = TSNE(n_components = 2, random_state = 0)

count_vect = CountVectorizer()
w2v = count_vect.fit_transform(words)
#w2v.get_shape()
#data_3500 = w2v[0:50,:]
#data_3500 = data_3500.toarray()

w2v.shape

Out[36]:
(10307, 10281)

In [42]:
In [42]:
#t-SNE for Word2Vec data.
from sklearn.manifold import TSNE
import seaborn as sn

model = TSNE(n_components = 2, random_state = 0)


data_3500 = w2v[0:3200,:]
data_3500 = data_3500.toarray()

lbl1 = final['Score']
lbl1 = lbl1[0:3200]

tsne_data =model.fit_transform(data_3500)
#print(tsne_data.shape)
tsne_data = np.vstack((tsne_data.T, lbl1)).T

#print(lbl1.shape)
#print(data_3500.shape)
#print(tsne_data.shape )

tsne_df = pd.DataFrame(data=tsne_data, columns=("Dim_1", "Dim_2", "lbl1"))


sn.FacetGrid(tsne_df, hue="lbl1", size=6).map(plt.scatter, 'Dim_1', 'Dim_2').add_legend()
plt.show()

In [ ]:
#list_of_sent = list_of_sent[0:10000]
#list_of_sent.size

Average Word2Vec
In [43]:

sent_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in list_of_sent: # for each review/sentence
#print(sent)
sent_vec = np.zeros(30) # as word vectors are of zero length
cnt_words =0; # num of words with a valid vector in the sentence/review
for word in sent:# for each word in a review/sentence
try:
vec = w2v_model.wv[word]
sent_vec1 = sent_vec + vec
cnt_words += 1
except:
pass
sent_vec /= cnt_words
sent_vec /= cnt_words
sent_vectors.append(sent_vec)

In [51]:

from sklearn.manifold import TSNE


import seaborn as sn

model = TSNE(n_components = 2, random_state = 0)


#data_3500 = w2v[0:2700,:]
data_35 = sent_vectors[0:3200]

lbl1 = final['Score']
lbl1 = lbl1[0:3200]

tsne_data =model.fit_transform(data_35)
#print(tsne_data.shape)
tsne_data = np.vstack((tsne_data.T, lbl1)).T
print(tsne_data.shape)

tsne_df = pd.DataFrame(data=tsne_data, columns=("Dim_1", "Dim_2", "lbl1"))


sn.FacetGrid(tsne_df, hue="lbl1", size=6).map(plt.scatter, 'Dim_1', 'Dim_2').add_legend()
plt.show()

(3200, 3)

You might also like