Amazon Assignment Ex
Amazon Assignment Ex
Objective : Objective of this exercise is to perform the sentiment analysis on Amazon Fine Food Reviews.
Data Analysis : Amazon Fine Food Review is a large data set with around 568K reviews given by the customers on the Amazon food
products. Each review has got the following 10 features.
1. ID
2. ProductId - Unique Identifier on the product.
3. UserId - Unique Identifier for the user.
4. ProfileName
5. HelpfulnessNumerator - Number of users who found the review helpful.
6. HelpfulnessDenominator - Number of users who indicated whether the review they found helpful.
7. Score - Rating between 1 to 5.
8. Time - Timestamp for the review.
9. Summary - Brief summary of the review.
10. Text - Text of the review.
In [4]:
import os
os.getcwd()
os.chdir(r"C:\Users\Sujatha\Applied AI Course\Ramesh\Data")
os.getcwd()
Out[4]:
'C:\\Users\\Sujatha\\Applied AI Course\\Ramesh\\Data'
In [2]:
%matplotlib inline
import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
In [3]:
con = sqlite3.connect('./amazon-fine-food-reviews/database.sqlite')
In [4]:
filtered_data.head()
Out[4]:
Good
0 1 B001E4KFG0 A3SGXH7AUHU8GW delmartian 1 1 5 1303862400 Quality
Dog Food
Not as
1 2 B00813GRG4 A1D87F6ZCVE5NK dll pa 0 0 1 1346976000
Advertised
Natalia
Corres "Delight"
2 3 B000LQOCH0 ABXLMWJIXXAIN 1 1 4 1219017600
"Natalia says it all
Corres"
Cough
3 4 B000UA0QIQ A395BORC6FGVXV Karl 3 3 2 1307923200
Medicine
Michael D.
4 5 B006K2ZZ7K A1UQRSCLF8GW1T Bigham "M. 0 0 5 1350777600 Great taffy
Wassir"
I limited my data to be 3500 reviews only as my machine configuration is small and is not supporting for large data.
In [5]:
filtered_data = filtered_data[0:3500]
filtered_data.shape
Out[5]:
(3500, 10)
The Score above 3 is taken as Positive and below 3 is taken as negative and equal to 3 is neglected.
In [6]:
def partition(y):
if y > 3:
return "postive"
return "negative"
In [7]:
actualscore = filtered_data["Score"]
positivenegative = actualscore.map(partition)
filtered_data['Score'] = positivenegative
In [8]:
filtered_data.shape
Out[8]:
(3500, 10)
In [9]:
filtered_data.head()
Out[9]:
Good
0 1 B001E4KFG0 A3SGXH7AUHU8GW delmartian 1 1 postive 1303862400 Quality
Dog Foo
Not as
1 2 B00813GRG4 A1D87F6ZCVE5NK dll pa 0 0 negative 1346976000
Advertis
Natalia
Corres "Delight"
2 3 B000LQOCH0 ABXLMWJIXXAIN 1 1 postive 1219017600
"Natalia says it a
Corres"
Cough
3 4 B000UA0QIQ A395BORC6FGVXV Karl 3 3 negative 1307923200
Medicine
Michael D.
4 5 B006K2ZZ7K A1UQRSCLF8GW1T Bigham "M. 0 0 postive 1350777600 Great ta
Wassir"
In [10]:
display = pd.read_sql_query("""SELECT * FROM Reviews where score!=3 and UserId like '%AR5J8U%' ord
er by ProductID""",con)
In [11]:
display.head()
Out[11]:
Geetha
0 78445 B000HDL1RQ AR5J8UI46CURR 2 2 5 1199577600
Krishnan
Geetha
1 138317 B000HDOPYC AR5J8UI46CURR 2 2 5 1199577600
Krishnan
Krishnan
Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenominator Score Time
Geetha
2 138277 B000HDOPYM AR5J8UI46CURR 2 2 5 1199577600
Krishnan
Geetha
3 73791 B000HDOPZG AR5J8UI46CURR 2 2 5 1199577600
Krishnan
Geetha
4 155049 B000PAQ75C AR5J8UI46CURR 2 2 5 1199577600
Krishnan
In [12]:
sorted_data = filtered_data.sort_values("ProductId",axis = 0,ascending = True)
In [13]:
#Duplication of entries.
final = sorted_data.drop_duplicates(subset = {"UserId","ProfileName","Time",'Text'}, keep = "first"
,inplace = False)
In [14]:
final.shape
Out[14]:
(3491, 10)
In [15]:
(final['Id'].size*1.0)/(filtered_data['Id'].size*1.0)
Out[15]:
0.9974285714285714
In [16]:
Out[16]:
(3491, 10)
In [17]:
final['Score'].value_counts()
Out[17]:
postive 2909
negative 582
Name: Score, dtype: int64
In [18]:
count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(final['Text'].values)
In [19]:
final_counts.shape
Out[19]:
(3491, 11010)
In [20]:
import re
i=0
for sent in final['Text'].values:
if(len(re.findall("<.*?",sent))):
print(i)
print(sent)
break;
i+=1
0
Why is this $[...] when the same product is available for $[...] here?<br
/>http://www.amazon.com/VICTOR-FLY-MAGNET-BAIT-REFILL/dp/B00004RBDY<br /><br />The Victor M380 and
M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.
In [21]:
import nltk
nltk.download('stopwords')
Out[21]:
True
In [22]:
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
print(stop)
{'their', 'own', 'just', "wasn't", 'hadn', 'has', 'each', 'i', 'by', "haven't", 'again',
'yourselves', "hasn't", "she's", 's', 'why', "should've", 'will', 'ma', 'into', 'him', 'its', 'are
', 'off', 'd', 'his', 'where', 'mustn', 'didn', 'an', 'both', "that'll", 'doesn', 'such', 'were',
'with', 'ain', "aren't", 'do', 'himself', 'being', 'is', 'any', 'our', 'was', 'on', 'needn', 'for'
, 'which', 'these', 'few', 'between', "wouldn't", 'ours', 'all', 'they', 'or', 'very', 'if',
'now', 'doing', 'other', 'after', 'won', "weren't", "you've", 'ourselves', 'have', 'yours', 'then'
, 'whom', "mightn't", 'he', 'hers', "hadn't", 'what', 'she', 'those', 'yourself', 'in', 'but', 'as
', 'above', 'while', 'below', "you'd", 'from', 'most', 'up', 'me', 'when', 'themselves', 'wasn', "
didn't", 'myself', 'be', 'her', 't', 'because', 'itself', 'o', 'your', 're', 'haven', 'how',
'not', 'shan', 'this', 'can', 'same', 'm', 'theirs', 'been', 'more', 'some', 'shouldn', 'my',
'during', "shouldn't", 'so', "couldn't", 'we', 'mightn', 'hasn', 'through', 'wouldn', 'until', "yo
u're", "you'll", "it's", 'out', 'should', 'that', 'only', 'weren', 'a', 'once', "mustn't", 'aren',
'having', "needn't", 'it', 'herself', 've', "shan't", 'before', 'who', "won't", "isn't", 'and', 't
he', 'to', 'nor', 'than', 'against', 'under', 'over', 'had', 'y', 'them', 'll', 'at', 'about', 'th
ere', 'don', 'am', 'you', "doesn't", 'further', 'no', 'down', 'too', 'isn', 'here', 'couldn',
'of', 'did', "don't", 'does'}
In [23]:
def cleanhtml(sentence):
cleanr = re.compile('<.*?')
cleantext = re.sub(cleanr, ' ', sentence)
return cleantext
def cleanpunc(sentence):
cleaned = re.sub(r'[?|!|\|"|#]',' ',sentence)
cleaned = re.sub(r'[.|,|)|(|\|/]',' ',cleaned)
return cleaned
print(sno.stem('tasty'))
tasti
In [24]:
i=0
strl = ' '
final_string = []
all_positive_words = [ ]
all_negative_words = [ ]
s = ' '
for sent in final['Text'].values:
filtered_sentence = []
sent = cleanhtml(sent)
for w in sent.split():
for cleaned_words in cleanpunc(w).split():
if(((cleaned_words.isalpha()& (len(cleaned_words)>2)))):
if(cleaned_words.lower() not in stop):
s = (sno.stem((cleaned_words.lower()))).encode('utf8')
filtered_sentence.append(s)
if(final['Score'].values)[i]== 'positive':
all_positive_words.append(s)
if(final['Score'].values)[i]== 'negative':
all_negative_words.append(s)
else:
continue
else:
continue
strl = b" ".join(filtered_sentence)
final_string.append(strl)
i+=1
In [25]:
final['CleanedText'] = final_string
final.head()
Out[25]:
Glenna E.
Bauer
2941 3203 B000084DVR A3DKGXWUEP1AI2 3 3 postive 1163030400
"Puppy
Mum"
In [26]:
con = sqlite3.connect('final_sqlite')
c = con.cursor()
con.text_factory = str
final.to_sql('Reviews',con, if_exists='replace', index=True, index_label=None, chunksize=None,
dtype=None)
Bigram
In [27]:
#Bigrams
count_vect = CountVectorizer(ngram_range = (1,2))
final_bigram_counts = count_vect.fit_transform(final['Text'].values)
final_bigram_counts.get_shape()
Out[27]:
(3491, 111023)
In [30]:
#t-SNE for Bigrams data.
from sklearn.manifold import TSNE
import seaborn as sn
#TF IDF
Out[31]:
(3491, 111023)
In [32]:
import gensim
i= 0
list_of_sent = []
for sent in final['Text']:
filtered_sentence = []
sent = cleanhtml(sent)
for w in sent.split():
for cleaned_words in cleanpunc(w).split():
if(cleaned_words.isalpha()):
filtered_sentence.append(cleaned_words.lower())
else:
continue
list_of_sent.append(filtered_sentence)
print(final['Text'].values[0])
print(list_of_sent[0])
Why is this $[...] when the same product is available for $[...] here?<br
/>http://www.amazon.com/VICTOR-FLY-MAGNET-BAIT-REFILL/dp/B00004RBDY<br /><br />The Victor M380 and
M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.
['why', 'is', 'this', 'when', 'the', 'same', 'product', 'is', 'available', 'for', 'here', 'br',
'www', 'amazon', 'com', 'dp', 'br', 'br', 'victor', 'and', 'traps', 'are', 'unreal', 'of',
'course', 'total', 'fly', 'genocide', 'pretty', 'stinky', 'but', 'only', 'right', 'nearby']
Word2Vec
In [34]:
Out[34]:
<gensim.models.word2vec.Word2Vec at 0x1e5e6dd5b38>
In [35]:
words=list(w2v_model.wv.vocab)
print(len(words))
10307
In [36]:
from sklearn.manifold import TSNE
import seaborn as sn
#words1 = gensim.W2VTransformer(final['Text'])
model = TSNE(n_components = 2, random_state = 0)
count_vect = CountVectorizer()
w2v = count_vect.fit_transform(words)
#w2v.get_shape()
#data_3500 = w2v[0:50,:]
#data_3500 = data_3500.toarray()
w2v.shape
Out[36]:
(10307, 10281)
In [42]:
In [42]:
#t-SNE for Word2Vec data.
from sklearn.manifold import TSNE
import seaborn as sn
lbl1 = final['Score']
lbl1 = lbl1[0:3200]
tsne_data =model.fit_transform(data_3500)
#print(tsne_data.shape)
tsne_data = np.vstack((tsne_data.T, lbl1)).T
#print(lbl1.shape)
#print(data_3500.shape)
#print(tsne_data.shape )
In [ ]:
#list_of_sent = list_of_sent[0:10000]
#list_of_sent.size
Average Word2Vec
In [43]:
sent_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in list_of_sent: # for each review/sentence
#print(sent)
sent_vec = np.zeros(30) # as word vectors are of zero length
cnt_words =0; # num of words with a valid vector in the sentence/review
for word in sent:# for each word in a review/sentence
try:
vec = w2v_model.wv[word]
sent_vec1 = sent_vec + vec
cnt_words += 1
except:
pass
sent_vec /= cnt_words
sent_vec /= cnt_words
sent_vectors.append(sent_vec)
In [51]:
lbl1 = final['Score']
lbl1 = lbl1[0:3200]
tsne_data =model.fit_transform(data_35)
#print(tsne_data.shape)
tsne_data = np.vstack((tsne_data.T, lbl1)).T
print(tsne_data.shape)
(3200, 3)