Sentiment Analysis - Comparing Algorithms Accuracy
Sentiment Analysis - Comparing Algorithms Accuracy
Note:
From the machine learning point of view, raw text is useless. Only if we manage to transform
it into meaningful numbers, can we feed it into our machine-learning algorithms such as clustering.
The same is true for more mundane operations on text,
such as similarity measurement
This project can pull data from Tweeter but to do that you need to request for your own API keys
specified below (I removed mine):
my_api_key = "xxxxxxxxx"
my_api_secret = "yyyyyyy"
If you don't have API keys already, you may use "Raw Data" which i pulled from tweeter using:
You can specifiy amount of tweets you want to pull. Here I pulled 100
import numpy as np
import re
import string
import nltk
import matplotlib
%matplotlib inline
sns.set(style="white",color_codes=True)
sns.set(font_scale=1.5)
stopword = set(stopwords.words('english'))
import tweepy as tw
warnings.filterwarnings('ignore')
matplotlib_axes_logger.setLevel('ERROR')
In [ ]:
my_api_key = "xxxxxxxxxxxxxxxxxxx"
my_api_secret = "xxxxxxxxxxxxxxxxxxxxxxxx"
# authenticate
# tweets = tw.Cursor(api.search_tweets,q=search_query,lang="en",since="2015-09-16").item
tweets = tw.Cursor(api.search_tweets,q=search_query,lang="en").items(500)
tweets_copy = []
tweets_copy.append(tweet)
data= pd.DataFrame()
hashtags = []
try:
hashtags.append(hashtag["text"])
except:
pass
'user_location': tweet.user.location,
'user_description': tweet.user.description,
'user_verified': tweet.user.verified,
'date': tweet.created_at,
'text': text,
'language': tweet.lang,
'favourites-count': tweet.favorite_count,
'author': tweet.user.screen_name,
'retweet-count': tweet.retweet_count,
Affter importing all libraries above, run this cell to load data.
data.head()
Out[2]: Unnamed:
user_name ID user_location user_description user_verified
0
I'm a mushroom
Alexandria spore floating 2022-0
0 0 Horatio 1490817580352819203 False
,MN around Central 22:39:43+0
M...
Moscow, 2022-0
1 0 jeremy t 1490817570399666180 NaN False
Russia 22:39:41+0
Marie 2022-0
2 0 1490817565362327552 NaN NaN False
williams 22:39:40+0
el ch'val a 2022-0
3 0 1490817548153196546 NaN Ti Tannant False
coukse 22:39:36+0
#Christian | #Gay
Honk Honk Toronto, 2022-0
4 0 1490817547926708225 | #Torontonian False
🚛🚚🛻!!! Ontario 22:39:36+0
Philippians 4:13
In [ ]:
print(data.skew())
In [5]:
def clean_text(text):
text = str(text).lower()
return text
def process_tweets(tweet):
#tokenizing words
tokens = word_tokenize(tweet)
finalwords=[]
for w in final_tokens:
if len(w)>1:
word = wordLemm.lemmatize(w)
finalwords.append(word)
In [6]:
# Now we have cleaned data for three features: user_description, text, and user_name
pd.DataFrame(data).head()
Out[6]:
user_name ID user_location user_description user_verified date
im a mushroom
rt po
Alexandria spore floating 2022-02-07
0 horatio 1490817580352819203 False ma
,MN around central 22:39:43+00:00
mi...
rt v
Moscow, 2022-02-07
0 jeremy t 1490817570399666180 False worke
Russia 22:39:41+00:00
rt k
marie 2022-02-07
0 1490817565362327552 False
williams 22:39:40+00:00
🖕
el chval a 2022-02-07
0 1490817548153196546 ti tannant False
coukse 22:39:36+00:00 🎶truc
christian gay rt v
honk honk Toronto, 2022-02-07
0 1490817547926708225 torontonian False worke
🚛🚚🛻 Ontario 22:39:36+00:00
philippians
In [7]:
data.to_csv("Clean Data.csv")
data.head()
Out[7]:
user_name ID user_location user_description user_verified date
im a mushroom
rt po
Alexandria spore floating 2022-02-07
0 horatio 1490817580352819203 False ma
,MN around central 22:39:43+00:00
mi...
rt v
Moscow, 2022-02-07
0 jeremy t 1490817570399666180 False worke
Russia 22:39:41+00:00
rt k
marie 2022-02-07
0 1490817565362327552 False
williams 22:39:40+00:00
🖕
el chval a 2022-02-07
0 1490817548153196546 ti tannant False
coukse 22:39:36+00:00 🎶truc
christian gay rt v
honk honk Toronto, 2022-02-07
0 1490817547926708225 torontonian False worke
🚛🚚🛻 Ontario 22:39:36+00:00
philippians
For example,- Words like 'love,' 'enjoy,' 'happy,' 'like' all convey a positive sentiment. Also, VADER is
intelligent enough to understand these words' basic context, such as "did not love" as a negative
statement. It also understands the emphasis of capitalization and punctuation, such as "ENJOY."
In [8]:
## Added "Sentiment" column and categorized in positive, negative and neutral
In [9]:
sid = SIA()
In [10]:
# drop sentiments column... not needed
data.drop(columns=['Sentiments'],inplace=True)
data.head()
Out[10]:
user_name ID user_location user_description user_verified date
im a mushroom
rt po
Alexandria spore floating 2022-02-07
0 horatio 1490817580352819203 False ma
,MN around central 22:39:43+00:00
mi...
rt v
Moscow, 2022-02-07
0 jeremy t 1490817570399666180 False worke
Russia 22:39:41+00:00
rt k
marie 2022-02-07
0 1490817565362327552 False
williams 22:39:40+00:00
🖕
el chval a 2022-02-07
0 1490817548153196546 ti tannant False
coukse 22:39:36+00:00 🎶truc
christian gay rt v
honk honk Toronto, 2022-02-07
0 1490817547926708225 torontonian False worke
🚛🚚🛻 Ontario 22:39:36+00:00
philippians
In [11]:
#Number of Words
Out[11]:
user_name ID user_location user_description user_verified date
im a mushroom
rt po
Alexandria spore floating 2022-02-07
0 horatio 1490817580352819203 False ma
,MN around central 22:39:43+00:00
mi...
rt v
Moscow, 2022-02-07
0 jeremy t 1490817570399666180 False worke
Russia 22:39:41+00:00
rt k
marie 2022-02-07
0 1490817565362327552 False
williams 22:39:40+00:00
🖕
el chval a 2022-02-07
0 1490817548153196546 ti tannant False
coukse 22:39:36+00:00 🎶truc
christian gay rt v
honk honk Toronto, 2022-02-07
0 1490817547926708225 torontonian False worke
🚛🚚🛻 Ontario 22:39:36+00:00
philippians
In [12]:
# WordCloud using atual clean data
#plt.axis('off')
#plt.show
Sentimental Analysis
In starting with the analysis we will create the new columns namely Polarity and Subjectivity and
acquire the very values of each comment. Polarity ranges from -1 to 1 and measures how positive or
negative a comment is. It simply means emotions expressed in a sentence. Subjectivity expresses
some personal feelings, views, or beliefs. A subjective sentence may not express any sentiment.
In [13]:
# get subjectivity
def getSubjectivity(txt):
return TextBlob(txt).sentiment.subjectivity
# get polarity
def getPolarity(txt):
return TextBlob(txt).sentiment.polarity
#Columns
data['Subjectivity'] = data['text'].apply(getSubjectivity)
data['Polarity'] = data['text'].apply(getPolarity)
data.head()
Out[13]:
user_name ID user_location user_description user_verified date
im a mushroom
rt po
Alexandria spore floating 2022-02-07
0 horatio 1490817580352819203 False ma
,MN around central 22:39:43+00:00
mi...
rt v
Moscow, 2022-02-07
0 jeremy t 1490817570399666180 False worke
Russia 22:39:41+00:00
rt k
marie 2022-02-07
0 1490817565362327552 False
williams 22:39:40+00:00
🖕
el chval a 2022-02-07
0 1490817548153196546 ti tannant False
coukse 22:39:36+00:00 🎶truc
christian gay rt v
honk honk Toronto, 2022-02-07
0 1490817547926708225 torontonian False worke
🚛🚚🛻 Ontario 22:39:36+00:00
philippians
In [14]:
# function to compute analysis
def getAnalysis(score):
if score < 0 :
return 'Negative'
elif score == 0:
return 'Neutral'
else:
return
'Positive'
data['Analysis'] = data['Polarity'].apply(getAnalysis)
In [15]:
data.head()
Out[15]:
user_name ID user_location user_description user_verified date
im a mushroom
rt po
Alexandria spore floating 2022-02-07
0 horatio 1490817580352819203 False ma
,MN around central 22:39:43+00:00
mi...
rt v
Moscow, 2022-02-07
0 jeremy t 1490817570399666180 False worke
Russia 22:39:41+00:00
rt k
marie 2022-02-07
0 1490817565362327552 False
williams 22:39:40+00:00
🖕
el chval a 2022-02-07
0 1490817548153196546 ti tannant False
coukse 22:39:36+00:00 🎶truc
christian gay rt v
honk honk Toronto, 2022-02-07
0 1490817547926708225 torontonian False worke
🚛🚚🛻 Ontario 22:39:36+00:00
philippians
5 rows × 21 columns
In [16]:
# % Percentages:
pcomments = pcomments['text']
ncomments = ncomments['text']
nucomments = nucomments['text']
Positive: 28.4%
Negative: 33.6%
Nuetral: 38.0%
In [17]:
# the below function will create a word cloud
])
width=2500,
height=2000
).generate(cleaned_word)
plt.figure(1,figsize=(5, 7))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
In [18]:
wordcloud_draw(data.text, 'black')
In [19]:
print("Positive words are", pcomments.count())
wordcloud_draw(pcomments, 'black')
In [20]:
print("Negative words are", ncomments.count())
wordcloud_draw(ncomments)
In [21]:
print("Neutral words are", nucomments.count())
wordcloud_draw(nucomments, 'black')
In [22]:
# Value Count
data['Analysis'].value_counts
# Plot
plt.title('Sentiment Analysis')
plt.xlabel('Sentiment')
plt.ylabel('Counts')
data['Analysis'].value_counts().plot(kind= 'bar')
plt.show()
user_name 0
Out[23]:
ID 0
user_location 0
user_description 0
user_verified 0
date 0
text 0
language 0
favourites-count 0
author 0
retweet-count 0
hashtags 112
source 0
Positive Sentiment 0
Neutral Sentiment 0
Negative Sentiment 0
Number of Words 0
Subjectivity 0
Analysis 0
dtype: int64
In [24]:
data.shape
(500, 21)
Out[24]:
In [25]:
data.dropna(inplace=True)
data.isnull().sum()
user_name 0
Out[25]:
ID 0
user_location 0
user_description 0
user_verified 0
date 0
text 0
language 0
favourites-count 0
author 0
retweet-count 0
hashtags 0
source 0
Positive Sentiment 0
Neutral Sentiment 0
Negative Sentiment 0
Number of Words 0
Subjectivity 0
Polarity 0
Analysis 0
dtype: int64
In [26]:
data.shape
(388, 21)
Out[26]:
In [27]:
data.columns
Out[27]:
'date', 'text', 'language', 'favourites-count', 'author',
dtype='object')
In [28]:
# drop irrelevant data
In [29]:
# check data types and encode object type
data.dtypes
user_location object
Out[29]:
user_description object
user_verified bool
text object
favourites-count int64
retweet-count int64
source object
Subjectivity float64
Analysis object
dtype: object
In [30]:
enco = LabelEncoder()
data['user_location'] = enco.fit_transform(data['user_location'])
data['user_description'] = enco.fit_transform(data['user_description'])
data['user_verified'] = enco.fit_transform(data['user_verified'])
data['text'] = enco.fit_transform(data['text'])
data['date'] = enco.fit_transform(data['date'])
data['source'] = enco.fit_transform(data['source'])
data['Analysis'] = enco.fit_transform(data['Analysis'])
In [31]:
data.head()
In [32]:
X = data.drop(["Analysis"], axis=1)
y= data.Analysis
In [33]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
fit = pca.fit(X)
fit.explained_variance_ratio_
print(fit.components_)
3.97336839e-06]
4.75840743e-05]
-3.09667478e-05]]
In [34]:
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.10,random_state=1)
In [35]:
#Feature Scaling/Standardize (not important step but it boost accuracy)
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
In [36]:
print (x_train.shape, y_train.shape)
(349, 9) (349,)
(39, 9) (39,)
In [64]:
#Gaussian Naive Bayes model
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)
[[15 0 0]
[ 0 15 0]
[ 9 0 0]]
0 0 2
0 0 0
0 1 1
0 0 0
0 0 0
In [65]:
#DecisionTree Classifier
classifier.fit(x_train, y_train)
y_predict = classifier.predict(x_test)
cm = confusion_matrix(y_test, y_predict)
print(cm)
accuracy_score(y_test, y_predict)
accuracy = classifier.score(x_test, y_test)
[[14 0 1]
[ 0 15 0]
[ 0 0 9]]
0 2 2
0 0 0
0 1 1
0 0 0
0 0 0
In [66]:
#K-Nearest Neighbors (K-NN)
classifier.fit(x_train, y_train)
cm = confusion_matrix(y_test, y_predict)
print(cm)
accuracy_score(y_test, y_predict)
accuracy = classifier.score(x_test, y_test)
[[14 0 1]
[ 0 15 0]
[ 0 0 9]]
0 2 2
0 0 0
0 1 1
predictions actual
0 0 0
0 0 0
In [67]:
#Kernel SVM
classifier.fit(x_train, y_train)
cm = confusion_matrix(y_test, y_predict)
print(cm)
accuracy_score(y_test, y_predict)
accuracy = classifier.score(x_test, y_test)
[[14 0 1]
[ 0 15 0]
[ 0 0 9]]
0 2 2
0 0 0
0 1 1
0 0 0
0 0 0
In [68]:
#Logistic Regression
classifier = LogisticRegression(random_state = 0)
classifier.fit(x_train, y_train)
cm = confusion_matrix(y_test, y_predict)
print(cm)
accuracy_score(y_test, y_predict)
accuracy = classifier.score(x_test, y_test)
[[14 0 1]
[ 0 15 0]
[ 0 0 9]]
In [69]:
#Random Tree Regression
cm = confusion_matrix(y_test, y_predict)
print(cm)
accuracy_score(y_test, y_predict)
accuracy = classifier.score(x_test, y_test)
[[14 0 1]
[ 0 15 0]
[ 0 0 9]]
0 2 2
0 0 0
0 1 1
0 0 0
0 0 0
In [70]:
print(Gaussian_Naive_Bayes)
print(DecisionTree)
print(Kernel_SVM)
print(Logistic_Regression)
print(Random_Tree_Regression)
In [71]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit_transform(X)
x_pca = pca.transform(X)
In [72]:
plt.figure(figsize=(9,6))
plt.scatter(x_pca[:,0],x_pca[:,1],c=y,cmap='viridis')
In [45]:
from sklearn.metrics import confusion_matrix, accuracy_score
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
"""
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.yticks(tick_marks, classes)
if normalize:
else:
print(cm)
thresh = cm.max() / 2.
horizontalalignment="center",
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
In [57]:
import itertools
plt.figure(figsize=(7,5))
print()
print("Accuracy: {:.2f}%".format(accuracy*100))
[[16 0 2]
[ 0 0 17]
[ 3 0 1]]
Accuracy: 89.74%
In [47]:
plt.figure(figsize=(20,7))
<AxesSubplot:>
Out[47]:
In [48]:
print(classification_report(y_test,y_pred))
accuracy 0.44 39
In [49]:
data.columns
Out[49]:
'favourites-count', 'retweet-count', 'source', 'Subjectivity',
'Analysis'],
dtype='object')
In [50]:
from sklearn.metrics import mean_squared_error,r2_score
In [51]:
classifier = LogisticRegression(random_state = 0)
classifier.fit(x_train, y_train)
coeff = list(classifier.coef_[0])
'Analysis']])
features = pd.DataFrame()
features['Features'] = labels
features['importance'] = coeff
features.set_index('Features', inplace=True)
Text(0.5, 0, 'Importance')
Out[51]:
In [56]:
import statsmodels.formula.api as smf
==============================================================================
Df Model: 4
====================================================================================
------------------------------------------------------------------------------------
==============================================================================
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specifi
ed.
In [ ]: