Twitter Sentiment Analysis
Twitter Sentiment Analysis
Our task involves developing a robust sentiment analysis system for Twitter data, where we aim to
accurately classify tweets into positive, negative, or neutral categories. We are challenged with
implementing naive Bayesian classifiers along with additional classification algorithms to enhance
the precision and efficacy of our sentiment analysis model. By addressing this complex problem, we
endeavor to decipher the emotional nuances and underlying sentiments expressed in tweets, thereby
enabling comprehensive insights into the prevailing sentiments within the Twitter community. This
endeavor will empower stakeholders to make informed decisions, engage effectively with users, and
gain valuable insights into market trends and brand perceptions.
Basic checks
# Importing Essential Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import
CountVectorizer,TfidfVectorizer
# To show rows
train_df.head(2)
negative 6851
neutral 2327
positive 1802
Name: airline_sentiment, dtype: int64
# Add a column name length which the contain the count of the text
train_df['Length'] = train_df['text'].apply(len)
text airline_sentiment
Length
0 @SouthwestAir I am scheduled for the morning, ... negative
141
1 @SouthwestAir seeing your workers time in and ... positive
124
2 @united Flew ORD to Miami and back and had gr... positive
84
3 @SouthwestAir @dultch97 that's horse radish 😤🐴
negative 46
4 @united so our flight into ORD was delayed bec... negative
139
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10980 entries, 0 to 10979
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 text 10980 non-null object
1 airline_sentiment 10980 non-null object
2 Length 10980 non-null int64
dtypes: int64(1), object(2)
memory usage: 257.5+ KB
text airline_sentiment
count 10980 10980
unique 10851 3
top @united thanks negative
freq 6 6851
sns.histplot(train_df.Length,kde=True,color='c')
import string
string.punctuation
{"type":"string"}
import nltk
nltk.download('stopwords')
True
# to import stopwords for english language
from nltk.corpus import stopwords
stopwords=stopwords.words('english')
np.array(stopwords)
train_df['text_clean'] = train_df['text'].apply(lambda x:
remove_stopwords(x))
train_df.head()
def remove_punc(text):
text = "".join([char for char in text if char not in
string.punctuation])
return text
train_df['text_clean'] = train_df['text_clean'].apply(lambda x:
remove_punc(x))
train_df.head()
text
airline_sentiment \
0 @SouthwestAir I am scheduled for the morning, ...
0
1 @SouthwestAir seeing your workers time in and ...
2
2 @united Flew ORD to Miami and back and had gr...
2
3 @SouthwestAir @dultch97 that's horse radish 😤🐴
0
4 @united so our flight into ORD was delayed bec...
0
Length text_clean
0 141 SouthwestAir I scheduled morning 2 days fact y...
1 124 SouthwestAir seeing workers time time going be...
2 84 united Flew ORD Miami back great crew service...
3 46 SouthwestAir dultch97 thats horse radish 😤🐴
4 139 united flight ORD delayed Air Force One last f...
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
(8784,)
(8784,)
(2196,)
(2196,)
CV=CountVectorizer(stop_words='english')
TV=TfidfVectorizer(stop_words='english')
x_train=CV.fit_transform(x_train)
x_train
x_test=CV.transform(x_test)
model_nb.fit(x_train,y_train)
MultinomialNB()
y_nb_pred=model_nb.predict(x_test)
Model Evaluation
from sklearn.metrics import
accuracy_score,precision_score,recall_score,f1_score,classification_re
port,roc_auc_score,confusion_matrix
naive_bayes_accuracy=accuracy_score(y_test,y_nb_pred)
naive_bayes_accuracy
0.755464480874317
print(classification_report(y_test,y_nb_pred))
confusion_matrix(y_test,y_nb_pred)
x_train=x_train.todense()
x_test=x_test.todense()
x_train=np.array(x_train)
x_test=np.array(x_test)
lr.fit(x_train,y_train)
LogisticRegression(multi_class='ovr', n_jobs=-1)
y_lr_pred=lr.predict(x_test)
0.7891621129326047
print(classification_report(y_test,y_lr_pred))
confusion_matrix(y_test,y_lr_pred)
dtc = DecisionTreeClassifier()
dtc.fit(x_train,y_train)
DecisionTreeClassifier()
y_pred_dtc= dtc.predict(x_test)
Model Evaluation Using Decision Tree Classifier
decisiontree_accuracy=accuracy_score(y_test,y_pred_dtc)
decisiontree_accuracy
0.651183970856102
print(classification_report(y_test,y_pred_dtc))
confusion_matrix(y_test,y_pred_dtc)
rfc=RandomForestClassifier()
rfc.fit(x_train,y_train)
RandomForestClassifier()
y_pred_rfc=rfc.predict(x_test)
Model Evaluation of Random Forest Classifier
rf_accuracy=accuracy_score(y_test,y_pred_rfc)
rf_accuracy
0.7418032786885246
print(classification_report(y_test,y_pred_rfc))
confusion_matrix(y_test,y_pred_rfc)
xgb=XGBRFClassifier(n_jobs=-1)
xgb.fit(x_train,y_train)
y_xgb_pred=xgb.predict(x_test)
0.703551912568306
print(classification_report(y_test,y_xgb_pred))
confusion_matrix(y_test,y_xgb_pred)
y=[naive_bayes_accuracy,logistic_accuracy,decisiontree_accuracy,rf_acc
uracy,xgb_accuracy])
<Axes: >
Conclusion
It's evident that logistic regression performed the best among the tested models,
closely followed by Naive Bayes. These accuracies indicate the efficacy of our
predictive models in capturing underlying patterns and making informed predictions.
However, there is still room for improvement, particularly in enhancing the
performance of the Decision Tree and XGBoost models.