0% found this document useful (0 votes)
42 views16 pages

Twitter Sentiment Analysis

The document discusses developing a sentiment analysis system for classifying tweets into positive, negative or neutral categories using naive Bayesian classifiers and other algorithms. It aims to accurately decipher the underlying sentiments expressed in tweets to provide insights into prevailing sentiments within Twitter.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views16 pages

Twitter Sentiment Analysis

The document discusses developing a sentiment analysis system for classifying tweets into positive, negative or neutral categories using naive Bayesian classifiers and other algorithms. It aims to accurately decipher the underlying sentiments expressed in tweets to provide insights into prevailing sentiments within Twitter.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Problem Statement

Our task involves developing a robust sentiment analysis system for Twitter data, where we aim to
accurately classify tweets into positive, negative, or neutral categories. We are challenged with
implementing naive Bayesian classifiers along with additional classification algorithms to enhance
the precision and efficacy of our sentiment analysis model. By addressing this complex problem, we
endeavor to decipher the emotional nuances and underlying sentiments expressed in tweets, thereby
enabling comprehensive insights into the prevailing sentiments within the Twitter community. This
endeavor will empower stakeholders to make informed decisions, engage effectively with users, and
gain valuable insights into market trends and brand perceptions.

Basic checks
# Importing Essential Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import
CountVectorizer,TfidfVectorizer

# Enter the DataSet


train_df=pd.read_csv('twitter_x_y_train.csv')

# To show rows
train_df.head(2)

{"summary":"{\n \"name\": \"train_df\",\n \"rows\": 10980,\n


\"fields\": [\n {\n \"column\": \"tweet_id\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
779543750863503,\n \"min\": 567588278875213824,\n
\"max\": 570310600460525568,\n \"samples\": [\n
570272186419847168,\n 568207206023360512,\n
568774317410009088\n ],\n \"num_unique_values\": 10894,\
n \"semantic_type\": \"\",\n \"description\": \"\"\n
}\n },\n {\n \"column\": \"airline_sentiment\",\n
\"properties\": {\n \"dtype\": \"category\",\n
\"samples\": [\n \"negative\",\n \"positive\",\n
\"neutral\"\n ],\n \"num_unique_values\": 3,\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"airline\",\n \"properties\":
{\n \"dtype\": \"category\",\n \"samples\": [\n
\"Southwest\",\n \"United\",\n \"Virgin America\"\n
],\n \"num_unique_values\": 6,\n \"semantic_type\":
\"\",\n \"description\": \"\"\n }\n },\n {\n
\"column\": \"airline_sentiment_gold\",\n \"properties\": {\n
\"dtype\": \"category\",\n \"samples\": [\n
\"negative\",\n \"neutral\",\n \"positive\"\n
],\n \"num_unique_values\": 3,\n \"semantic_type\":
\"\",\n \"description\": \"\"\n }\n },\n {\n
\"column\": \"name\",\n \"properties\": {\n \"dtype\":
\"string\",\n \"samples\": [\n \"sethdpowers\",\n
\"reiokam\",\n \"Tom_Fili\"\n ],\n
\"num_unique_values\": 6438,\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"negativereason_gold\",\n \"properties\": {\n \"dtype\":
\"category\",\n \"samples\": [\n \"Late Flight\\
nCancelled Flight\",\n \"Customer Service Issue\",\n
\"Lost Luggage\\nDamaged Luggage\"\n ],\n
\"num_unique_values\": 11,\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"retweet_count\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 0,\n \"min\": 0,\n
\"max\": 44,\n \"samples\": [\n 6,\n 8,\n
0\n ],\n \"num_unique_values\": 15,\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"text\",\n \"properties\": {\n
\"dtype\": \"string\",\n \"samples\": [\n \"@united
done just now, thanks.\",\n \"@SouthwestAir thanks for the
great customer service today! \\ud83d\\udc4d\\ud83d\\udc4c\",\n
\"@USAirways Flight Cancelled Flighted because of #Neptune Could not
get to my destination. #notmyfault #waive$200fee #ripoff
#poorcustomerservice\"\n ],\n \"num_unique_values\":
10851,\n \"semantic_type\": \"\",\n \"description\":
\"\"\n }\n },\n {\n \"column\": \"tweet_coord\",\n
\"properties\": {\n \"dtype\": \"category\",\n
\"samples\": [\n \"[35.23908248, -120.64078264]\",\n
\"[42.88502249, -85.52894527]\",\n \"[41.95627333, -
87.87860345]\"\n ],\n \"num_unique_values\": 632,\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"tweet_created\",\n
\"properties\": {\n \"dtype\": \"object\",\n \"min\":
\"2015-02-16 23:36:05 -0800\",\n \"max\": \"2015-02-24 11:53:37
-0800\",\n \"samples\": [\n \"2015-02-19 17:50:42 -
0800\",\n \"2015-02-22 17:14:05 -0800\",\n \"2015-
02-23 11:21:47 -0800\"\n ],\n \"num_unique_values\":
10758,\n \"semantic_type\": \"\",\n \"description\":
\"\"\n }\n },\n {\n \"column\": \"tweet_location\",\n
\"properties\": {\n \"dtype\": \"category\",\n
\"samples\": [\n \"Upstate New York\",\n \"iPhone:
37.621227,-122.386002\",\n \"St. Louis\"\n ],\n
\"num_unique_values\": 2658,\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"user_timezone\",\n \"properties\": {\n \"dtype\":
\"category\",\n \"samples\": [\n \"Wellington\",\n
\"Atlantic Time (Canada)\",\n \"Taipei\"\n ],\n
\"num_unique_values\": 78,\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n }\n ]\
n}","type":"dataframe","variable_name":"train_df"}

# To count the number of sentiment


train_df.airline_sentiment.value_counts()

negative 6851
neutral 2327
positive 1802
Name: airline_sentiment, dtype: int64

# Add a column name length which the contain the count of the text
train_df['Length'] = train_df['text'].apply(len)

# To segregate the specific coulumns and removes the rest


train_df=train_df[['text','airline_sentiment','Length']]

# It's shows the first 5 row of the data


train_df.head()

text airline_sentiment
Length
0 @SouthwestAir I am scheduled for the morning, ... negative
141
1 @SouthwestAir seeing your workers time in and ... positive
124
2 @united Flew ORD to Miami and back and had gr... positive
84
3 @SouthwestAir @dultch97 that's horse radish 😤🐴
negative 46
4 @united so our flight into ORD was delayed bec... negative
139

# For data overview


train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10980 entries, 0 to 10979
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 text 10980 non-null object
1 airline_sentiment 10980 non-null object
2 Length 10980 non-null int64
dtypes: int64(1), object(2)
memory usage: 257.5+ KB

# To generate the descriptive statistics for the numerical columns in


the DataFrame train_df
train_df.describe()
Length
count 10980.000000
mean 103.663297
std 36.513462
min 12.000000
25% 76.000000
50% 114.000000
75% 136.000000
max 186.000000

# To generates descriptive statistics for the object-type (string)


columns
train_df.describe(include='O')

text airline_sentiment
count 10980 10980
unique 10851 3
top @united thanks negative
freq 6 6851

Exploratory Data Analysis (EDA)


Exploratory Data Analysis (EDA) refers to the method of studying and exploring record sets to
apprehend their predominant traits, discover patterns, locate outliers, and identify relationships
between variables.

sns.histplot(train_df.Length,kde=True,color='c')

<Axes: xlabel='Length', ylabel='Count'>


The bar graph shown here show the most of text column reviews are between 125 - 150 world
length

# @title airline_sentiment vs airline

from matplotlib import pyplot as plt


import seaborn as sns
import pandas as pd
plt.subplots(figsize=(8, 8))
df_2dhist = pd.DataFrame({
x_label: grp['airline'].value_counts()
for x_label, grp in train_df.groupby('airline_sentiment')
})
sns.heatmap(df_2dhist, cmap='viridis')
plt.xlabel('airline_sentiment')
_ = plt.ylabel('airline')
plt.pie(train_df.airline_sentiment.value_counts(),labels=['Negative','
Neutral','Positive'],autopct='%.2f')
plt.show()
Data Prepocessing
Catelogical Encoding
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()

# To converts the text data into numercial value


train_df.airline_sentiment=le.fit_transform(train_df.airline_sentiment
)

import string
string.punctuation

{"type":"string"}

import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...


[nltk_data] Package stopwords is already up-to-date!

True
# to import stopwords for english language
from nltk.corpus import stopwords
stopwords=stopwords.words('english')

np.array(stopwords)

array(['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves',


'you',
"you're", "you've", "you'll", "you'd", 'your', 'yours',
'yourself',
'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's",
'her',
'hers', 'herself', 'it', "it's", 'its', 'itself', 'they',
'them',
'their', 'theirs', 'themselves', 'what', 'which', 'who',
'whom',
'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are',
'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the',
'and',
'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of',
'at',
'by', 'for', 'with', 'about', 'against', 'between', 'into',
'through', 'during', 'before', 'after', 'above', 'below', 'to',
'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over',
'under',
'again', 'further', 'then', 'once', 'here', 'there', 'when',
'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few',
'more',
'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only',
'own',
'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will',
'just', 'don', "don't", 'should', "should've", 'now', 'd',
'll',
'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn',
"couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',
"hadn't",
'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma',
'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't",
'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't",
'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"],
dtype='<U10')

# Remonving the stopwords for better analysis


def remove_stopwords(text):
text=text.split(' ')
text = " ".join([char for char in text if char not in stopwords])
return text

train_df['text_clean'] = train_df['text'].apply(lambda x:
remove_stopwords(x))

train_df.head()

def remove_punc(text):
text = "".join([char for char in text if char not in
string.punctuation])
return text

train_df['text_clean'] = train_df['text_clean'].apply(lambda x:
remove_punc(x))

train_df.head()

text
airline_sentiment \
0 @SouthwestAir I am scheduled for the morning, ...
0
1 @SouthwestAir seeing your workers time in and ...
2
2 @united Flew ORD to Miami and back and had gr...
2
3 @SouthwestAir @dultch97 that's horse radish 😤🐴
0
4 @united so our flight into ORD was delayed bec...
0

Length text_clean
0 141 SouthwestAir I scheduled morning 2 days fact y...
1 124 SouthwestAir seeing workers time time going be...
2 84 united Flew ORD Miami back great crew service...
3 46 SouthwestAir dultch97 thats horse radish 😤🐴
4 139 united flight ORD delayed Air Force One last f...

# Show the pocessed text column


train_df.text_clean

0 SouthwestAir I scheduled morning 2 days fact y...


1 SouthwestAir seeing workers time time going be...
2 united Flew ORD Miami back great crew service...
3 SouthwestAir dultch97 thats horse radish 😤🐴
4 united flight ORD delayed Air Force One last f...
...
10975 AmericanAir followback
10976 united thanks help Wish phone reps could accom...
10977 usairways the Worst Ever dca customerservice
10978 nrhodes85 look Another apology DO NOT FLY USAi...
10979 united far worst airline 4 plane delays 1 roun...
Name: text_clean, Length: 10980, dtype: object
x=train_df.text
y=train_df.airline_sentiment

from sklearn.model_selection import train_test_split


x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,rando
m_state=40)

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(8784,)
(8784,)
(2196,)
(2196,)

CV=CountVectorizer(stop_words='english')

TV=TfidfVectorizer(stop_words='english')

x_train=CV.fit_transform(x_train)

x_train

<8784x10856 sparse matrix of type '<class 'numpy.int64'>'


with 80223 stored elements in Compressed Sparse Row format>

x_test=CV.transform(x_test)

Model Creation with Naive Bayes


Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It
is not a single algorithm but a family of algorithms where all of them share a common principle,
i.e. every pair of features being classified is independent of each other.

from sklearn.naive_bayes import MultinomialNB


model_nb=MultinomialNB()

model_nb.fit(x_train,y_train)

MultinomialNB()

y_nb_pred=model_nb.predict(x_test)
Model Evaluation
from sklearn.metrics import
accuracy_score,precision_score,recall_score,f1_score,classification_re
port,roc_auc_score,confusion_matrix

naive_bayes_accuracy=accuracy_score(y_test,y_nb_pred)
naive_bayes_accuracy

0.755464480874317

We get the accuracy of appox 75.55%

print(classification_report(y_test,y_nb_pred))

precision recall f1-score support

0 0.76 0.95 0.85 1384


1 0.67 0.34 0.45 440
2 0.77 0.51 0.61 372

accuracy 0.76 2196


macro avg 0.73 0.60 0.64 2196
weighted avg 0.75 0.76 0.73 2196

confusion_matrix(y_test,y_nb_pred)

array([[1320, 42, 22],


[ 256, 149, 35],
[ 151, 31, 190]])

Model Creation with the Logistic Regression


Logistic regression is used for binary classification where we use sigmoid function, that takes
input as independent variables and produces a probability value between 0 and 1.

x_train=x_train.todense()
x_test=x_test.todense()
x_train=np.array(x_train)
x_test=np.array(x_test)

from sklearn.linear_model import LogisticRegression


lr=LogisticRegression(multi_class='ovr',n_jobs=-1)

lr.fit(x_train,y_train)

LogisticRegression(multi_class='ovr', n_jobs=-1)
y_lr_pred=lr.predict(x_test)

Model Evaluation with the logistic regression


logistic_accuracy=accuracy_score(y_test,y_lr_pred)
logistic_accuracy

0.7891621129326047

Logistic Regression gives an accuracy of appox 78.91 %.

print(classification_report(y_test,y_lr_pred))

precision recall f1-score support

0 0.84 0.90 0.87 1384


1 0.61 0.57 0.59 440
2 0.77 0.64 0.70 372

accuracy 0.79 2196


macro avg 0.74 0.70 0.72 2196
weighted avg 0.78 0.79 0.78 2196

confusion_matrix(y_test,y_lr_pred)

array([[1247, 101, 36],


[ 155, 249, 36],
[ 79, 56, 237]])

Model Creation using Decision Tree Classifier


A decision tree is a flowchart-like tree structure where each internal node denotes the feature,
branches denote the rules and the leaf nodes denote the result of the algorithm. It is a versatile
supervised machine-learning algorithm, which is used for both classification and regression
problems.

from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier()

dtc.fit(x_train,y_train)

DecisionTreeClassifier()

y_pred_dtc= dtc.predict(x_test)
Model Evaluation Using Decision Tree Classifier
decisiontree_accuracy=accuracy_score(y_test,y_pred_dtc)
decisiontree_accuracy

0.651183970856102

Decision Tree give an accuracy of 65.11%.

print(classification_report(y_test,y_pred_dtc))

precision recall f1-score support

0 0.80 0.74 0.77 1384


1 0.37 0.47 0.41 440
2 0.57 0.56 0.57 372

accuracy 0.65 2196


macro avg 0.58 0.59 0.58 2196
weighted avg 0.67 0.65 0.66 2196

confusion_matrix(y_test,y_pred_dtc)

array([[1018, 277, 89],


[ 171, 205, 64],
[ 88, 77, 207]])

Model creation by using Random Forest


Classifier
Random Forest is a classifier that contains a number of decision trees on various subsets of the
given dataset and takes the average to improve the predictive accuracy of that dataset

from sklearn.ensemble import RandomForestClassifier

rfc=RandomForestClassifier()

rfc.fit(x_train,y_train)

RandomForestClassifier()

y_pred_rfc=rfc.predict(x_test)
Model Evaluation of Random Forest Classifier
rf_accuracy=accuracy_score(y_test,y_pred_rfc)
rf_accuracy

0.7418032786885246

Random Forest gives an accuracy of appox 74.19% .

print(classification_report(y_test,y_pred_rfc))

precision recall f1-score support

0 0.82 0.86 0.84 1384


1 0.51 0.50 0.51 440
2 0.70 0.56 0.63 372

accuracy 0.74 2196


macro avg 0.68 0.64 0.66 2196
weighted avg 0.74 0.74 0.74 2196

confusion_matrix(y_test,y_pred_rfc)

array([[1197, 143, 44],


[ 173, 222, 45],
[ 94, 68, 210]])

Model Creation with the XGBoost


XGBoost, short for eXtreme Gradient Boosting, is a machine learning algorithm known for its
efficiency, speed, and accuracy. It belongs to the family of boosting algorithms, which are
ensemble learning techniques that combine the predictions of multiple weak learners

from xgboost import XGBRFClassifier

xgb=XGBRFClassifier(n_jobs=-1)

xgb.fit(x_train,y_train)

XGBRFClassifier(base_score=None, booster=None, callbacks=None,


colsample_bylevel=None, colsample_bytree=None,
device=None,
early_stopping_rounds=None, enable_categorical=False,
eval_metric=None, feature_types=None, gamma=None,
grow_policy=None, importance_type=None,
interaction_constraints=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan,
monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=-1,
num_parallel_tree=None, objective='multi:softprob',
random_state=None, reg_alpha=None, ...)

y_xgb_pred=xgb.predict(x_test)

Model Evaluation using XGBoost


xgb_accuracy=accuracy_score(y_test,y_xgb_pred)
xgb_accuracy

0.703551912568306

XGBoost gives the accuracy of appox 70.35%.

print(classification_report(y_test,y_xgb_pred))

precision recall f1-score support

0 0.71 0.95 0.81 1384


1 0.61 0.10 0.18 440
2 0.68 0.49 0.57 372

accuracy 0.70 2196


macro avg 0.67 0.52 0.52 2196
weighted avg 0.69 0.70 0.64 2196

confusion_matrix(y_test,y_xgb_pred)

array([[1316, 19, 49],


[ 358, 45, 37],
[ 178, 10, 184]])

Model Comparision Report


plt.figure(figsize=(20,7))
sns.barplot(x=['Naive Bayes','Logistic Regression','Decision Tree
Classifier','Random Forest Classifier','XGBRFClassifier'],

y=[naive_bayes_accuracy,logistic_accuracy,decisiontree_accuracy,rf_acc
uracy,xgb_accuracy])

<Axes: >
Conclusion
It's evident that logistic regression performed the best among the tested models,
closely followed by Naive Bayes. These accuracies indicate the efficacy of our
predictive models in capturing underlying patterns and making informed predictions.
However, there is still room for improvement, particularly in enhancing the
performance of the Decision Tree and XGBoost models.

You might also like