Alexa Sentiment Analysis
Alexa Sentiment Analysis
Problem Statement:
In [4]: data.head()
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 1/32
8/25/24, 11:10 AM Notebook
31-Jul- Charcoal
0 5 Love my Echo! 1
18 Fabric
31-Jul- Charcoal
1 5 Loved it! 1
18 Fabric
31-Jul- Charcoal
3 5 "I have had a lot of fun with this thing. My 4... 1
18 Fabric
31-Jul- Charcoal
4 5 Music 1
18 Fabric
Out[6]: rating 0
date 0
variation 0
verified_reviews 1
feedback 0
dtype: int64
In [9]: data.shape
Out[9]: (3149, 5)
In [10]: #Creating a new column 'length' that will contain the length of the string in 'veri
data['length'] = data['verified_reviews'].apply(len)
In [11]: data.head()
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 2/32
8/25/24, 11:10 AM Notebook
31-Jul- Charcoal
0 5 Love my Echo! 1 13
18 Fabric
31-Jul- Charcoal
1 5 Loved it! 1 9
18 Fabric
31-Jul- Charcoal "I have had a lot of fun with this thing.
3 5 1 174
18 Fabric My 4...
31-Jul- Charcoal
4 5 Music 1 5
18 Fabric
The 'length' column is new generated column - stores the length of 'verified_reviews' for that
record. Let's check for some sample records
'verified_reviews' column value: "I sent it to my 85 year old Dad, and he talks to i
t constantly."
Length of review : 65
'length' column value : 65
We can see that the length of review is the same as the value in the length column for that
record
In [13]: data.dtypes
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 3/32
8/25/24, 11:10 AM Notebook
In [14]: len(data)
Out[14]: 3149
Out[15]: rating
5 2286
4 455
1 161
3 152
2 95
Name: count, dtype: int64
data['rating'].value_counts().plot.bar(color = 'red')
plt.title('Rating distribution count')
plt.xlabel('Ratings')
plt.ylabel('Count')
plt.show()
In [17]: #Finding the percentage distribution of each rating - we'll divide the number of re
data['rating'].value_counts()/data.shape[0]*100,2
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 4/32
8/25/24, 11:10 AM Notebook
Out[17]: (rating
5 72.594474
4 14.449031
1 5.112734
3 4.826929
2 3.016831
Name: count, dtype: float64,
2)
wp = {'linewidth':1, "edgecolor":'black'}
tags = data['rating'].value_counts()/data.shape[0]
explode=(0.1,0.1,0.1,0.1,0.1)
graph = BytesIO()
fig.savefig(graph, format="png")
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 5/32
8/25/24, 11:10 AM Notebook
Out[19]: feedback
1 2893
0 256
Name: count, dtype: int64
There are 2 distinct values of 'feedback' present - 0 and 1. Let's see what kind of review each
value corresponds to.
feedback value = 0
In [20]: #Extracting the 'verified_reviews' value for one record with feedback = 0
review_0 = data[data['feedback'] == 0].iloc[50]['verified_reviews']
review_0
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 6/32
8/25/24, 11:10 AM Notebook
In [21]: #Extracting the 'verified_reviews' value for one record with feedback = 1
From the above 2 examples we can see that feedback 0 is negative review and 1 is positive
review
data['feedback'].value_counts().plot.bar(color = 'blue')
plt.title('Feedback distribution count')
plt.xlabel('Feedback')
plt.ylabel('Count')
plt.show()
In [23]: #Finding the percentage distribution of each feedback - we'll divide the number of
data['feedback'].value_counts()/data.shape[0]*100,2
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 7/32
8/25/24, 11:10 AM Notebook
Out[23]: (feedback
1 91.870435
0 8.129565
Name: count, dtype: float64,
2)
Feedback distribution
wp = {'linewidth':1, "edgecolor":'black'}
tags = data['feedback'].value_counts()/data.shape[0]
explode=(0.1,0.1)
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 8/32
8/25/24, 11:10 AM Notebook
In [25]: #Feedback = 0
data[data['feedback'] == 0]['rating'].value_counts()
Out[25]: rating
1 161
2 95
Name: count, dtype: int64
In [26]: #Feedback = 1
data[data['feedback'] == 1]['rating'].value_counts()
Out[26]: rating
5 2286
4 455
3 152
Name: count, dtype: int64
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 9/32
8/25/24, 11:10 AM Notebook
Out[27]: variation
Black Dot 516
Charcoal Fabric 430
Configuration: Fire TV Stick 350
Black Plus 270
Black Show 265
Black 261
Black Spot 241
White Dot 184
Heather Gray Fabric 157
White Spot 109
Sandstone Fabric 90
White 90
White Show 85
White Plus 78
Oak Finish 14
Walnut Finish 9
Name: count, dtype: int64
data['variation'].value_counts().plot.bar(color = 'orange')
plt.title('Variation distribution count')
plt.xlabel('Variation')
plt.ylabel('Count')
plt.show()
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 10/32
8/25/24, 11:10 AM Notebook
In [29]: #Finding the percentage distribution of each variation - we'll divide the number of
data['variation'].value_counts()/data.shape[0]*100,2
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 11/32
8/25/24, 11:10 AM Notebook
Out[29]: (variation
Black Dot 16.386154
Charcoal Fabric 13.655129
Configuration: Fire TV Stick 11.114640
Black Plus 8.574151
Black Show 8.415370
Black 8.288346
Black Spot 7.653223
White Dot 5.843125
Heather Gray Fabric 4.985710
White Spot 3.461416
Sandstone Fabric 2.858050
White 2.858050
White Show 2.699270
White Plus 2.476977
Oak Finish 0.444586
Walnut Finish 0.285805
Name: count, dtype: float64,
2)
In [30]: data.groupby('variation')['rating'].mean()
Out[30]: variation
Black 4.233716
Black Dot 4.453488
Black Plus 4.370370
Black Show 4.490566
Black Spot 4.311203
Charcoal Fabric 4.730233
Configuration: Fire TV Stick 4.591429
Heather Gray Fabric 4.694268
Oak Finish 4.857143
Sandstone Fabric 4.355556
Walnut Finish 4.888889
White 4.166667
White Dot 4.423913
White Plus 4.358974
White Show 4.282353
White Spot 4.311927
Name: rating, dtype: float64
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 12/32
8/25/24, 11:10 AM Notebook
In [32]: data['length'].describe()
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 13/32
8/25/24, 11:10 AM Notebook
In [34]: sns.histplot(data[data['feedback']==0]['length'],color='red').set(title='Distributi
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 14/32
8/25/24, 11:10 AM Notebook
In [35]: sns.histplot(data[data['feedback']==1]['length'],color='green').set(title='Distribu
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 15/32
8/25/24, 11:10 AM Notebook
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 16/32
8/25/24, 11:10 AM Notebook
In [36]: cv = CountVectorizer(stop_words='english')
words = cv.fit_transform(data.verified_reviews)
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 17/32
8/25/24, 11:10 AM Notebook
In [38]: # Combine all reviews for each feedback category and splitting them into individual
neg_reviews = " ".join([review for review in data[data['feedback'] == 0]['verified_
neg_reviews = neg_reviews.lower().split()
#Finding words from reviews which are present in that feedback category only
unique_negative = [x for x in neg_reviews if x not in pos_reviews]
unique_negative = " ".join(unique_negative)
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 18/32
8/25/24, 11:10 AM Notebook
Negative words can be seen in the above word cloud - garbage, pointless, poor, horrible,
repair etc
Positive words can be seen in the above word cloud - good, enjoying, amazing, best, great
etc
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 19/32
8/25/24, 11:10 AM Notebook
In [41]: corpus = []
stemmer = PorterStemmer()
for i in range(0, data.shape[0]):
review = re.sub('[^a-zA-Z]', ' ', data.iloc[i]['verified_reviews'])
review = review.lower().split()
review = [stemmer.stem(word) for word in review if not word in STOPWORDS]
review = ' '.join(review)
corpus.append(review)
Splitting data into train and test set with 30% data with testing.
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 20/32
8/25/24, 11:10 AM Notebook
# Initialize RandomOverSampler
oversampler = RandomOverSampler(random_state=42)
# Perform oversampling
X_resampled, y_resampled = oversampler.fit_resample(X_train, y_train)
We'll scale X_train and X_test so that all values are between 0 and 1.
X_train_scl = scaler.fit_transform(X_train)
X_test_scl = scaler.transform(X_test)
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 21/32
8/25/24, 11:10 AM Notebook
# K-Fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 22/32
8/25/24, 11:10 AM Notebook
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 23/32
8/25/24, 11:10 AM Notebook
AdaBoost (Training):
CrossVal_Score_Mean: 0.9251
CrossVal_Error: 0.0086
Bagging (Training):
CrossVal_Score_Mean: 0.9269
CrossVal_Error: 0.0084
XGBoost (Training):
CrossVal_Score_Mean: 0.9283
CrossVal_Error: 0.0068
SVM (Training):
CrossVal_Score_Mean: 0.9220
CrossVal_Error: 0.0094
KNN (Training):
CrossVal_Score_Mean: 0.9192
CrossVal_Error: 0.0090
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 24/32
8/25/24, 11:10 AM Notebook
Confusion_Matrix:
[[ 29 49]
[ 10 857]]
Confusion_Matrix:
[[ 29 49]
[ 6 861]]
AdaBoost (Test):
Accuracy: 0.9270
F1_Score: 0.9610
ROC_AUC_Score: 0.8801200721615946
Classification_Report:
precision recall f1-score support
Confusion_Matrix:
[[ 25 53]
[ 16 851]]
Bagging (Test):
Accuracy: 0.9280
F1_Score: 0.9607
ROC_AUC_Score: 0.8835950669860704
Classification_Report:
precision recall f1-score support
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 25/32
8/25/24, 11:10 AM Notebook
Confusion_Matrix:
[[ 45 33]
[ 35 832]]
Confusion_Matrix:
[[ 36 42]
[ 16 851]]
Confusion_Matrix:
[[ 23 55]
[ 9 858]]
XGBoost (Test):
Accuracy: 0.9418
F1_Score: 0.9690
ROC_AUC_Score: 0.9110549197054386
Classification_Report:
precision recall f1-score support
Confusion_Matrix:
[[ 30 48]
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 26/32
8/25/24, 11:10 AM Notebook
[ 7 860]]
Confusion_Matrix:
[[ 44 34]
[ 40 827]]
SVM (Test):
Accuracy: 0.9270
F1_Score: 0.9617
ROC_AUC_Score: 0.8877724543814509
Classification_Report:
precision recall f1-score support
Confusion_Matrix:
[[ 9 69]
[ 0 867]]
KNN (Test):
Accuracy: 0.9132
F1_Score: 0.9546
ROC_AUC_Score: 0.7544287699997042
Classification_Report:
precision recall f1-score support
Confusion_Matrix:
[[ 0 78]
[ 4 863]]
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 27/32
8/25/24, 11:10 AM Notebook
Confusion_Matrix:
[[ 51 27]
[370 497]]
# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid,
cv=5, n_jobs=-1, scoring='accuracy', verbose=2)
# Fit GridSearchCV
grid_search.fit(X_train, y_train)
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 28/32
8/25/24, 11:10 AM Notebook
In [53]: # Train the Random Forest model with the best parameters
best_rf = RandomForestClassifier(**best_params, random_state=42)
best_rf.fit(X_train, y_train)
print("Test Metrics:")
print(f" Accuracy: {test_accuracy:.4f}")
print(f" F1_Score: {test_f1:.4f}")
print(f" ROC_AUC_Score: {test_roc_auc:.4f}")
print(f" Classification_Report:\n{test_clf_report}")
print(f" Confusion_Matrix:\n{test_cm}\n")
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 29/32
8/25/24, 11:10 AM Notebook
Training Metrics:
Accuracy: 0.9946
F1_Score: 0.9970
ROC_AUC_Score: 0.9985
Classification_Report:
precision recall f1-score support
Confusion_Matrix:
[[ 166 12]
[ 0 2026]]
Test Metrics:
Accuracy: 0.9429
F1_Score: 0.9696
ROC_AUC_Score: 0.9124
Classification_Report:
precision recall f1-score support
Confusion_Matrix:
[[ 30 48]
[ 6 861]]
Confusion Matrix
In [54]: # Print the Confusion Matrix and slice it into four pieces
cm = confusion_matrix(y_test, y_test_pred)
Confusion matrix
[[ 30 48]
[ 6 861]]
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 30/32
8/25/24, 11:10 AM Notebook
Classification Report
In [57]: from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_pred))
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 31/32
8/25/24, 11:10 AM Notebook
file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 32/32
Result:
• This model will predict 94% accurately.
Thank You..!