0% found this document useful (0 votes)
15 views34 pages

Alexa Sentiment Analysis

Sentimental Analysis, an application of NLP

Uploaded by

freeguyfreeguy67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views34 pages

Alexa Sentiment Analysis

Sentimental Analysis, an application of NLP

Uploaded by

freeguyfreeguy67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Amazon Alexa Review – Sentiment Analysis

Problem Statement:

• To Analyze the Amazon Alexa dataset and building


classification models to predict if the sentiment of a
given input sentence is positive or negative.
8/25/24, 11:10 AM Notebook

Amazon Alexa Review - Sentiment Analysis


Analyzing the Amazon Alexa dataset and building classification models to predict if the
sentiment of a given input sentence is positive or negative.

1. Importing Required Libraries


In [2]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.stem.porter import PorterStemmer
nltk.download('stopwords')
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))

from sklearn.model_selection import train_test_split


from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import CountVectorizer
from wordcloud import WordCloud
import pickle
import re
import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to


[nltk_data] C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data] Package stopwords is already up-to-date!

2. Exploratory Data Analysis


In [3]: #Load the data

data = pd.read_csv("amazon_alexa.tsv", delimiter = '\t', quoting = 3)

print(f"Dataset shape : {data.shape}")

Dataset shape : (3150, 5)

In [4]: data.head()

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 1/32
8/25/24, 11:10 AM Notebook

Out[4]: rating date variation verified_reviews feedback

31-Jul- Charcoal
0 5 Love my Echo! 1
18 Fabric

31-Jul- Charcoal
1 5 Loved it! 1
18 Fabric

31-Jul- "Sometimes while playing a game, you can


2 4 Walnut Finish 1
18 answe...

31-Jul- Charcoal
3 5 "I have had a lot of fun with this thing. My 4... 1
18 Fabric

31-Jul- Charcoal
4 5 Music 1
18 Fabric

In [5]: #Column names


data.columns

Out[5]: Index(['rating', 'date', 'variation', 'verified_reviews', 'feedback'], dtype='obje


ct')

In [6]: #Check for null values


data.isnull().sum()

Out[6]: rating 0
date 0
variation 0
verified_reviews 1
feedback 0
dtype: int64

There is one record with no 'verified_reviews' (null value)

In [7]: #Getting the record where 'verified_reviews' is null


data[data['verified_reviews'].isna() == True]

Out[7]: rating date variation verified_reviews feedback

473 2 29-Jun-18 White NaN 0

In [8]: #We will drop the null record


data.dropna(inplace=True)

In [9]: data.shape

Out[9]: (3149, 5)

In [10]: #Creating a new column 'length' that will contain the length of the string in 'veri
data['length'] = data['verified_reviews'].apply(len)

In [11]: data.head()

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 2/32
8/25/24, 11:10 AM Notebook

Out[11]: rating date variation verified_reviews feedback length

31-Jul- Charcoal
0 5 Love my Echo! 1 13
18 Fabric

31-Jul- Charcoal
1 5 Loved it! 1 9
18 Fabric

31-Jul- Walnut "Sometimes while playing a game, you


2 4 1 197
18 Finish can answe...

31-Jul- Charcoal "I have had a lot of fun with this thing.
3 5 1 174
18 Fabric My 4...

31-Jul- Charcoal
4 5 Music 1 5
18 Fabric

The 'length' column is new generated column - stores the length of 'verified_reviews' for that
record. Let's check for some sample records

In [12]: #Randomly checking for 10th record

print(f"'verified_reviews' column value: {data.iloc[10]['verified_reviews']}") #Ori


print(f"Length of review : {len(data.iloc[10]['verified_reviews'])}") #Length of re
print(f"'length' column value : {data.iloc[10]['length']}") #Value of the column 'l

'verified_reviews' column value: "I sent it to my 85 year old Dad, and he talks to i
t constantly."
Length of review : 65
'length' column value : 65

We can see that the length of review is the same as the value in the length column for that
record

Datatypes of the features

In [13]: data.dtypes

Out[13]: rating int64


date object
variation object
verified_reviews object
feedback int64
length int64
dtype: object

rating, feedback and length are integer values


date, variation and verified_reviews are string values

Analyzing 'rating' column


This column refers to the rating of the variation given by the user

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 3/32
8/25/24, 11:10 AM Notebook

In [14]: len(data)

Out[14]: 3149

In [15]: #Distinct values of 'rating' and its count


data['rating'].value_counts()

Out[15]: rating
5 2286
4 455
1 161
3 152
2 95
Name: count, dtype: int64

Let's plot the above values in a bar graph

In [16]: #Bar plot to visualize the total counts of each rating

data['rating'].value_counts().plot.bar(color = 'red')
plt.title('Rating distribution count')
plt.xlabel('Ratings')
plt.ylabel('Count')
plt.show()

In [17]: #Finding the percentage distribution of each rating - we'll divide the number of re
data['rating'].value_counts()/data.shape[0]*100,2

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 4/32
8/25/24, 11:10 AM Notebook

Out[17]: (rating
5 72.594474
4 14.449031
1 5.112734
3 4.826929
2 3.016831
Name: count, dtype: float64,
2)

Let's plot the above values in a pie chart

In [18]: fig = plt.figure(figsize=(7,7))

colors = ('red', 'green', 'blue','orange','yellow')

wp = {'linewidth':1, "edgecolor":'black'}

tags = data['rating'].value_counts()/data.shape[0]

explode=(0.1,0.1,0.1,0.1,0.1)

tags.plot(kind='pie', autopct="%1.1f%%", shadow=True, colors=colors, startangle=90,

from io import BytesIO

graph = BytesIO()

fig.savefig(graph, format="png")

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 5/32
8/25/24, 11:10 AM Notebook

Analyzing 'feedback' column


This column refers to the feedback of the verified review

In [19]: #Distinct values of 'feedback' and its count


data['feedback'].value_counts()

Out[19]: feedback
1 2893
0 256
Name: count, dtype: int64

There are 2 distinct values of 'feedback' present - 0 and 1. Let's see what kind of review each
value corresponds to.

feedback value = 0

In [20]: #Extracting the 'verified_reviews' value for one record with feedback = 0
review_0 = data[data['feedback'] == 0].iloc[50]['verified_reviews']
review_0

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 6/32
8/25/24, 11:10 AM Notebook

Out[20]: 'Extremely low in volume'

In [21]: #Extracting the 'verified_reviews' value for one record with feedback = 1

review_1 = data[data['feedback'] == 1].iloc[8]['verified_reviews']


review_1

Out[21]: 'looks great'

From the above 2 examples we can see that feedback 0 is negative review and 1 is positive
review

Let's plot the feedback value count in a bar graph

In [22]: #Bar graph to visualize the total counts of each feedback

data['feedback'].value_counts().plot.bar(color = 'blue')
plt.title('Feedback distribution count')
plt.xlabel('Feedback')
plt.ylabel('Count')
plt.show()

In [23]: #Finding the percentage distribution of each feedback - we'll divide the number of
data['feedback'].value_counts()/data.shape[0]*100,2

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 7/32
8/25/24, 11:10 AM Notebook

Out[23]: (feedback
1 91.870435
0 8.129565
Name: count, dtype: float64,
2)

Feedback distribution

91.87% reviews are positive


8.13% reviews are negative

In [24]: fig = plt.figure(figsize=(7,7))

colors = ('red', 'green')

wp = {'linewidth':1, "edgecolor":'black'}

tags = data['feedback'].value_counts()/data.shape[0]

explode=(0.1,0.1)

tags.plot(kind='pie', autopct="%1.1f%%", shadow=True, colors=colors, startangle=90,

Out[24]: <Axes: ylabel='Percentage wise distrubution of feedback'>

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 8/32
8/25/24, 11:10 AM Notebook

Let's see the 'rating' values for different values of 'feedback'

In [25]: #Feedback = 0
data[data['feedback'] == 0]['rating'].value_counts()

Out[25]: rating
1 161
2 95
Name: count, dtype: int64

In [26]: #Feedback = 1
data[data['feedback'] == 1]['rating'].value_counts()

Out[26]: rating
5 2286
4 455
3 152
Name: count, dtype: int64

If rating of a review is 1 or 2 then the feedback is 0 (negative) and if the rating is 3, 4 or 5


then the feedback is 1 (positive).

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 9/32
8/25/24, 11:10 AM Notebook

Analyzing 'variation' column


This column refers to the variation or type of Amazon Alexa product. Example - Black Dot,
Charcoal Fabric etc.

In [27]: #Distinct values of 'variation' and its count


data['variation'].value_counts()

Out[27]: variation
Black Dot 516
Charcoal Fabric 430
Configuration: Fire TV Stick 350
Black Plus 270
Black Show 265
Black 261
Black Spot 241
White Dot 184
Heather Gray Fabric 157
White Spot 109
Sandstone Fabric 90
White 90
White Show 85
White Plus 78
Oak Finish 14
Walnut Finish 9
Name: count, dtype: int64

In [28]: #Bar graph to visualize the total counts of each variation

data['variation'].value_counts().plot.bar(color = 'orange')
plt.title('Variation distribution count')
plt.xlabel('Variation')
plt.ylabel('Count')
plt.show()

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 10/32
8/25/24, 11:10 AM Notebook

In [29]: #Finding the percentage distribution of each variation - we'll divide the number of
data['variation'].value_counts()/data.shape[0]*100,2

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 11/32
8/25/24, 11:10 AM Notebook

Out[29]: (variation
Black Dot 16.386154
Charcoal Fabric 13.655129
Configuration: Fire TV Stick 11.114640
Black Plus 8.574151
Black Show 8.415370
Black 8.288346
Black Spot 7.653223
White Dot 5.843125
Heather Gray Fabric 4.985710
White Spot 3.461416
Sandstone Fabric 2.858050
White 2.858050
White Show 2.699270
White Plus 2.476977
Oak Finish 0.444586
Walnut Finish 0.285805
Name: count, dtype: float64,
2)

Mean rating according to variation

In [30]: data.groupby('variation')['rating'].mean()

Out[30]: variation
Black 4.233716
Black Dot 4.453488
Black Plus 4.370370
Black Show 4.490566
Black Spot 4.311203
Charcoal Fabric 4.730233
Configuration: Fire TV Stick 4.591429
Heather Gray Fabric 4.694268
Oak Finish 4.857143
Sandstone Fabric 4.355556
Walnut Finish 4.888889
White 4.166667
White Dot 4.423913
White Plus 4.358974
White Show 4.282353
White Spot 4.311927
Name: rating, dtype: float64

Let's analyze the above ratings

In [31]: data.groupby('variation')['rating'].mean().sort_values().plot.bar(color = 'brown',


plt.title("Mean rating according to variation")
plt.xlabel('Variation')
plt.ylabel('Mean rating')
plt.show()

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 12/32
8/25/24, 11:10 AM Notebook

Analyzing 'verified_reviews' column


This column contains the textual review given by the user for a variation for the product.

In [32]: data['length'].describe()

Out[32]: count 3149.000000


mean 132.714513
std 182.541531
min 1.000000
25% 30.000000
50% 74.000000
75% 166.000000
max 2853.000000
Name: length, dtype: float64

Length analysis for full dataset

In [33]: sns.histplot(data['length'],color='blue').set(title='Distribution of length of revi

Out[33]: [Text(0.5, 1.0, 'Distribution of length of review ')]

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 13/32
8/25/24, 11:10 AM Notebook

Length analysis when feedback is 0 (negative)

In [34]: sns.histplot(data[data['feedback']==0]['length'],color='red').set(title='Distributi

Out[34]: [Text(0.5, 1.0, 'Distribution of length of review if feedback = 0')]

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 14/32
8/25/24, 11:10 AM Notebook

Length analysis when feedback is 1 (positive)

In [35]: sns.histplot(data[data['feedback']==1]['length'],color='green').set(title='Distribu

Out[35]: [Text(0.5, 1.0, 'Distribution of length of review if feedback = 1')]

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 15/32
8/25/24, 11:10 AM Notebook

Lengthwise mean rating

In [36]: data.groupby('length')['rating'].mean().plot.hist(color = 'blue', figsize=(7, 6), b


plt.title(" Review length wise mean ratings")
plt.xlabel('ratings')
plt.ylabel('length')
plt.show()

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 16/32
8/25/24, 11:10 AM Notebook

In [36]: cv = CountVectorizer(stop_words='english')
words = cv.fit_transform(data.verified_reviews)

In [37]: # Combine all reviews


reviews = " ".join([review for review in data['verified_reviews']])

# Initialize wordcloud object


wc = WordCloud(background_color='white', max_words=50)

# Generate and plot wordcloud


plt.figure(figsize=(10,10))
plt.imshow(wc.generate(reviews))
plt.title('Wordcloud for all reviews', fontsize=10)
plt.axis('off')
plt.show()

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 17/32
8/25/24, 11:10 AM Notebook

Lets find the unique words in each feedback category

In [38]: # Combine all reviews for each feedback category and splitting them into individual
neg_reviews = " ".join([review for review in data[data['feedback'] == 0]['verified_
neg_reviews = neg_reviews.lower().split()

pos_reviews = " ".join([review for review in data[data['feedback'] == 1]['verified_


pos_reviews = pos_reviews.lower().split()

#Finding words from reviews which are present in that feedback category only
unique_negative = [x for x in neg_reviews if x not in pos_reviews]
unique_negative = " ".join(unique_negative)

unique_positive = [x for x in pos_reviews if x not in neg_reviews]


unique_positive = " ".join(unique_positive)

In [39]: wc = WordCloud(background_color='white', max_words=50)

# Generate and plot wordcloud


plt.figure(figsize=(10,10))
plt.imshow(wc.generate(unique_negative))
plt.title('Wordcloud for negative reviews', fontsize=10)
plt.axis('off')
plt.show()

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 18/32
8/25/24, 11:10 AM Notebook

Negative words can be seen in the above word cloud - garbage, pointless, poor, horrible,
repair etc

In [40]: wc = WordCloud(background_color='white', max_words=50)

# Generate and plot wordcloud


plt.figure(figsize=(10,10))
plt.imshow(wc.generate(unique_positive))
plt.title('Wordcloud for positive reviews', fontsize=10)
plt.axis('off')
plt.show()

Positive words can be seen in the above word cloud - good, enjoying, amazing, best, great
etc

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 19/32
8/25/24, 11:10 AM Notebook

3. Preprocessing and Modelling


To build the corpus from the 'verified_reviews' we perform the following -

1. Replace any non alphabet characters with a space


2. Covert to lower case and split into words
3. Iterate over the individual words and if it is not a stopword then add the stemmed form
of the word to the corpus

In [41]: corpus = []
stemmer = PorterStemmer()
for i in range(0, data.shape[0]):
review = re.sub('[^a-zA-Z]', ' ', data.iloc[i]['verified_reviews'])
review = review.lower().split()
review = [stemmer.stem(word) for word in review if not word in STOPWORDS]
review = ' '.join(review)
corpus.append(review)

Using Count Vectorizer to create bag of words

In [42]: cv = CountVectorizer(max_features = 2500)

#Storing independent and dependent variables in X and y


X = cv.fit_transform(corpus).toarray()
y = data['feedback'].values

In [43]: #Saving the Count Vectorizer


pickle.dump(cv, open('countVectorizer.pkl', 'wb'))

Checking the shape of X and y

In [44]: print(f"X shape: {X.shape}")


print(f"y shape: {y.shape}")

X shape: (3149, 2500)


y shape: (3149,)

Splitting data into train and test set with 30% data with testing.

In [45]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_s

print(f"X train: {X_train.shape}")


print(f"y train: {y_train.shape}")
print(f"X test: {X_test.shape}")
print(f"y test: {y_test.shape}")

X train: (2204, 2500)


y train: (2204,)
X test: (945, 2500)
y test: (945,)

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 20/32
8/25/24, 11:10 AM Notebook

In [46]: print(f"X train max value: {X_train.max()}")


print(f"X test max value: {X_test.max()}")

X train max value: 12


X test max value: 10

In [47]: #Class Imbalance by Sampling Technique

from sklearn.datasets import make_classification


from collections import Counter
from imblearn.over_sampling import RandomOverSampler

# Check class distribution before oversampling


print("Before oversampling:", Counter(y_train))

# Initialize RandomOverSampler
oversampler = RandomOverSampler(random_state=42)

# Perform oversampling
X_resampled, y_resampled = oversampler.fit_resample(X_train, y_train)

# Check class distribution after oversampling


print("After oversampling:", Counter(y_resampled))

Before oversampling: Counter({1: 2026, 0: 178})


After oversampling: Counter({1: 2026, 0: 2026})

We'll scale X_train and X_test so that all values are between 0 and 1.

In [48]: scaler = MinMaxScaler()

X_train_scl = scaler.fit_transform(X_train)
X_test_scl = scaler.transform(X_test)

In [49]: #Saving the scaler model


pickle.dump(scaler, open('scaler.pkl', 'wb'))

Using all models for train and test


In [50]: from sklearn.ensemble import RandomForestClassifier ,AdaBoostClassifier,BaggingClas
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold,
from sklearn.metrics import roc_curve,accuracy_score,f1_score,auc,confusion_matrix,
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, classification
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingCla
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 21/32
8/25/24, 11:10 AM Notebook

from sklearn.neighbors import KNeighborsClassifier


from sklearn.naive_bayes import GaussianNB

In [51]: # List of classifiers


classifiers = {
'Logistic Regression': LogisticRegression(),
'Random Forest': RandomForestClassifier(),
'AdaBoost': AdaBoostClassifier(), ## BASIC BUT FAST BOOSTING TECHNIQUE
'Bagging': BaggingClassifier(),
'Extra Trees': ExtraTreesClassifier(), ## TREE BASED CLASSIFIER
'Gradient Boosting': GradientBoostingClassifier(),
'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
'Decision Tree': DecisionTreeClassifier(),
'SVM': SVC(probability=True), ## SUPPORT VECTOR MACHINES
'KNN': KNeighborsClassifier(), ## K NEAREST NEIGHBOURS
'Naive Bayes': GaussianNB() ## PROBABILITY BASED APPROACH
}

# Dictionaries to store results


results_train = {}
results_test = {}

# K-Fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

for name, clf in classifiers.items():


# Cross-validation
cv_results = cross_val_score(clf, X_train, y_train, cv=kfold, scoring='accuracy
results_train[name] = {
'CrossVal_Score_Mean': cv_results.mean(),
'CrossVal_Error': cv_results.std()
}

# Train the model


clf.fit(X_train, y_train)

# Make predictions on the test set


y_pred = clf.predict(X_test)
y_pred_proba = clf.predict_proba(X_test)[:, 1] if hasattr(clf, "predict_proba")

# Evaluate the predictions


accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba) if y_pred_proba is not None else
clf_report = classification_report(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

# Store the evaluation metrics


results_test[name] = {
'Accuracy': accuracy,
'F1_Score': f1,
'ROC_AUC_Score': roc_auc,
'Classification_Report': clf_report,
'Confusion_Matrix': cm
}

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 22/32
8/25/24, 11:10 AM Notebook

# Print the cross-validation results


for name, result in results_train.items():
print(f"{name} (Training):")
print(f" CrossVal_Score_Mean: {result['CrossVal_Score_Mean']:.4f}")
print(f" CrossVal_Error: {result['CrossVal_Error']:.4f}")
print()

# Print the test results


for name, result in results_test.items():
print(f"{name} (Test):")
print(f" Accuracy: {result['Accuracy']:.4f}")
print(f" F1_Score: {result['F1_Score']:.4f}")
print(f" ROC_AUC_Score: {result['ROC_AUC_Score']}")
print(f" Classification_Report:\n{result['Classification_Report']}")
print(f" Confusion_Matrix:\n{result['Confusion_Matrix']}\n")

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 23/32
8/25/24, 11:10 AM Notebook

Logistic Regression (Training):


CrossVal_Score_Mean: 0.9324
CrossVal_Error: 0.0082

Random Forest (Training):


CrossVal_Score_Mean: 0.9306
CrossVal_Error: 0.0082

AdaBoost (Training):
CrossVal_Score_Mean: 0.9251
CrossVal_Error: 0.0086

Bagging (Training):
CrossVal_Score_Mean: 0.9269
CrossVal_Error: 0.0084

Extra Trees (Training):


CrossVal_Score_Mean: 0.9319
CrossVal_Error: 0.0105

Gradient Boosting (Training):


CrossVal_Score_Mean: 0.9292
CrossVal_Error: 0.0088

XGBoost (Training):
CrossVal_Score_Mean: 0.9283
CrossVal_Error: 0.0068

Decision Tree (Training):


CrossVal_Score_Mean: 0.9165
CrossVal_Error: 0.0093

SVM (Training):
CrossVal_Score_Mean: 0.9220
CrossVal_Error: 0.0094

KNN (Training):
CrossVal_Score_Mean: 0.9192
CrossVal_Error: 0.0090

Naive Bayes (Training):


CrossVal_Score_Mean: 0.5640
CrossVal_Error: 0.0151

Logistic Regression (Test):


Accuracy: 0.9376
F1_Score: 0.9667
ROC_AUC_Score: 0.9202303847632568
Classification_Report:
precision recall f1-score support

0 0.74 0.37 0.50 78


1 0.95 0.99 0.97 867

accuracy 0.94 945


macro avg 0.84 0.68 0.73 945

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 24/32
8/25/24, 11:10 AM Notebook

weighted avg 0.93 0.94 0.93 945

Confusion_Matrix:
[[ 29 49]
[ 10 857]]

Random Forest (Test):


Accuracy: 0.9418
F1_Score: 0.9690
ROC_AUC_Score: 0.9189882589536568
Classification_Report:
precision recall f1-score support

0 0.83 0.37 0.51 78


1 0.95 0.99 0.97 867

accuracy 0.94 945


macro avg 0.89 0.68 0.74 945
weighted avg 0.94 0.94 0.93 945

Confusion_Matrix:
[[ 29 49]
[ 6 861]]

AdaBoost (Test):
Accuracy: 0.9270
F1_Score: 0.9610
ROC_AUC_Score: 0.8801200721615946
Classification_Report:
precision recall f1-score support

0 0.61 0.32 0.42 78


1 0.94 0.98 0.96 867

accuracy 0.93 945


macro avg 0.78 0.65 0.69 945
weighted avg 0.91 0.93 0.92 945

Confusion_Matrix:
[[ 25 53]
[ 16 851]]

Bagging (Test):
Accuracy: 0.9280
F1_Score: 0.9607
ROC_AUC_Score: 0.8835950669860704
Classification_Report:
precision recall f1-score support

0 0.56 0.58 0.57 78


1 0.96 0.96 0.96 867

accuracy 0.93 945


macro avg 0.76 0.77 0.77 945
weighted avg 0.93 0.93 0.93 945

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 25/32
8/25/24, 11:10 AM Notebook

Confusion_Matrix:
[[ 45 33]
[ 35 832]]

Extra Trees (Test):


Accuracy: 0.9386
F1_Score: 0.9670
ROC_AUC_Score: 0.9243412297045516
Classification_Report:
precision recall f1-score support

0 0.69 0.46 0.55 78


1 0.95 0.98 0.97 867

accuracy 0.94 945


macro avg 0.82 0.72 0.76 945
weighted avg 0.93 0.94 0.93 945

Confusion_Matrix:
[[ 36 42]
[ 16 851]]

Gradient Boosting (Test):


Accuracy: 0.9323
F1_Score: 0.9640
ROC_AUC_Score: 0.8767559814272617
Classification_Report:
precision recall f1-score support

0 0.72 0.29 0.42 78


1 0.94 0.99 0.96 867

accuracy 0.93 945


macro avg 0.83 0.64 0.69 945
weighted avg 0.92 0.93 0.92 945

Confusion_Matrix:
[[ 23 55]
[ 9 858]]

XGBoost (Test):
Accuracy: 0.9418
F1_Score: 0.9690
ROC_AUC_Score: 0.9110549197054386
Classification_Report:
precision recall f1-score support

0 0.81 0.38 0.52 78


1 0.95 0.99 0.97 867

accuracy 0.94 945


macro avg 0.88 0.69 0.75 945
weighted avg 0.94 0.94 0.93 945

Confusion_Matrix:
[[ 30 48]

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 26/32
8/25/24, 11:10 AM Notebook

[ 7 860]]

Decision Tree (Test):


Accuracy: 0.9217
F1_Score: 0.9572
ROC_AUC_Score: 0.8102209209475645
Classification_Report:
precision recall f1-score support

0 0.52 0.56 0.54 78


1 0.96 0.95 0.96 867

accuracy 0.92 945


macro avg 0.74 0.76 0.75 945
weighted avg 0.92 0.92 0.92 945

Confusion_Matrix:
[[ 44 34]
[ 40 827]]

SVM (Test):
Accuracy: 0.9270
F1_Score: 0.9617
ROC_AUC_Score: 0.8877724543814509
Classification_Report:
precision recall f1-score support

0 1.00 0.12 0.21 78


1 0.93 1.00 0.96 867

accuracy 0.93 945


macro avg 0.96 0.56 0.58 945
weighted avg 0.93 0.93 0.90 945

Confusion_Matrix:
[[ 9 69]
[ 0 867]]

KNN (Test):
Accuracy: 0.9132
F1_Score: 0.9546
ROC_AUC_Score: 0.7544287699997042
Classification_Report:
precision recall f1-score support

0 0.00 0.00 0.00 78


1 0.92 1.00 0.95 867

accuracy 0.91 945


macro avg 0.46 0.50 0.48 945
weighted avg 0.84 0.91 0.88 945

Confusion_Matrix:
[[ 0 78]
[ 4 863]]

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 27/32
8/25/24, 11:10 AM Notebook

Naive Bayes (Test):


Accuracy: 0.5799
F1_Score: 0.7146
ROC_AUC_Score: 0.6135436074882441
Classification_Report:
precision recall f1-score support

0 0.12 0.65 0.20 78


1 0.95 0.57 0.71 867

accuracy 0.58 945


macro avg 0.53 0.61 0.46 945
weighted avg 0.88 0.58 0.67 945

Confusion_Matrix:
[[ 51 27]
[370 497]]

Hyper Parameter Tuning


In [52]: # Define the parameter grid for GridSearchCV
param_grid = {
'n_estimators': [50, 100, 200],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth': [10, 20, 30, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'bootstrap': [True, False]
}

### 3*3*4*3*3*2*5 = 3240 fits

# Initialize the Random Forest model


rf = RandomForestClassifier(random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid,
cv=5, n_jobs=-1, scoring='accuracy', verbose=2)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Get the best parameters


best_params = grid_search.best_params_
print(f"Best Parameters: {best_params}")

Fitting 5 folds for each of 648 candidates, totalling 3240 fits


Best Parameters: {'bootstrap': True, 'max_depth': None, 'max_features': 'sqrt', 'min
_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}

Training the model with best hyper parameter

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 28/32
8/25/24, 11:10 AM Notebook

In [53]: # Train the Random Forest model with the best parameters
best_rf = RandomForestClassifier(**best_params, random_state=42)
best_rf.fit(X_train, y_train)

# Make predictions on the training set


y_train_pred = best_rf.predict(X_train)
y_train_pred_proba = best_rf.predict_proba(X_train)[:, 1]

# Make predictions on the test set


y_test_pred = best_rf.predict(X_test)
y_test_pred_proba = best_rf.predict_proba(X_test)[:, 1]

# Evaluate the model on the training set


train_accuracy = accuracy_score(y_train, y_train_pred)
train_f1 = f1_score(y_train, y_train_pred)
train_roc_auc = roc_auc_score(y_train, y_train_pred_proba)
train_clf_report = classification_report(y_train, y_train_pred)
train_cm = confusion_matrix(y_train, y_train_pred)

# Evaluate the model on the test set


test_accuracy = accuracy_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred)
test_roc_auc = roc_auc_score(y_test, y_test_pred_proba)
test_clf_report = classification_report(y_test, y_test_pred)
test_cm = confusion_matrix(y_test, y_test_pred)

# Print the evaluation metrics


print("Training Metrics:")
print(f" Accuracy: {train_accuracy:.4f}")
print(f" F1_Score: {train_f1:.4f}")
print(f" ROC_AUC_Score: {train_roc_auc:.4f}")
print(f" Classification_Report:\n{train_clf_report}")
print(f" Confusion_Matrix:\n{train_cm}\n")

print("Test Metrics:")
print(f" Accuracy: {test_accuracy:.4f}")
print(f" F1_Score: {test_f1:.4f}")
print(f" ROC_AUC_Score: {test_roc_auc:.4f}")
print(f" Classification_Report:\n{test_clf_report}")
print(f" Confusion_Matrix:\n{test_cm}\n")

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 29/32
8/25/24, 11:10 AM Notebook

Training Metrics:
Accuracy: 0.9946
F1_Score: 0.9970
ROC_AUC_Score: 0.9985
Classification_Report:
precision recall f1-score support

0 1.00 0.93 0.97 178


1 0.99 1.00 1.00 2026

accuracy 0.99 2204


macro avg 1.00 0.97 0.98 2204
weighted avg 0.99 0.99 0.99 2204

Confusion_Matrix:
[[ 166 12]
[ 0 2026]]

Test Metrics:
Accuracy: 0.9429
F1_Score: 0.9696
ROC_AUC_Score: 0.9124
Classification_Report:
precision recall f1-score support

0 0.83 0.38 0.53 78


1 0.95 0.99 0.97 867

accuracy 0.94 945


macro avg 0.89 0.69 0.75 945
weighted avg 0.94 0.94 0.93 945

Confusion_Matrix:
[[ 30 48]
[ 6 861]]

Confusion Matrix
In [54]: # Print the Confusion Matrix and slice it into four pieces

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_test_pred)

print('Confusion matrix\n\n', cm)

Confusion matrix

[[ 30 48]
[ 6 861]]

In [55]: print("True Positive : ", cm[1, 1])


print("True Negative : ", cm[0, 0])

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 30/32
8/25/24, 11:10 AM Notebook

print("False Positive: ", cm[0, 1])


print("False Negative: ", cm[1, 0])

True Positive : 861


True Negative : 30
False Positive: 48
False Negative: 6

In [56]: # visualize confusion matrix with seaborn heatmap

cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'


index=['Predict Positive:1', 'Predict Negative:0']

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')

Out[56]: <Axes: >

Classification Report
In [57]: from sklearn.metrics import classification_report

print(classification_report(y_test, y_test_pred))

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 31/32
8/25/24, 11:10 AM Notebook

precision recall f1-score support

0 0.83 0.38 0.53 78


1 0.95 0.99 0.97 867

accuracy 0.94 945


macro avg 0.89 0.69 0.75 945
weighted avg 0.94 0.94 0.93 945

file:///C:/Users/ASUS/Downloads/Amazon_Alexa_Review_Sentiment_Analysis_NLP.html 32/32
Result:
• This model will predict 94% accurately.

Thank You..!

You might also like