0% found this document useful (0 votes)
57 views

Sentiment Analysis - Comparing Algorithms Accuracy

The document discusses comparing the accuracy of different algorithms for sentiment analysis of tweets. It introduces importing necessary libraries in Python and extracting tweet data from Twitter using APIs. The data is then loaded and cleaned before analyzing sentiment using various algorithms and evaluating their performance.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Sentiment Analysis - Comparing Algorithms Accuracy

The document discusses comparing the accuracy of different algorithms for sentiment analysis of tweets. It introduces importing necessary libraries in Python and extracting tweet data from Twitter using APIs. The data is then loaded and cleaned before analyzing sentiment using various algorithms and evaluating their performance.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

2/7/22, 5:06 PM Sentiment Analysis - Comparing Algorithms Accuracy

Author: Adegbenro Michael Olusola:

Note:
From the machine learning point of view, raw text is useless. Only if we manage to transform
it into meaningful numbers, can we feed it into our machine-learning algorithms such as clustering.
The same is true for more mundane operations on text,
such as similarity measurement

This project can pull data from Tweeter but to do that you need to request for your own API keys
specified below (I removed mine):

my_api_key = "xxxxxxxxx"
my_api_secret = "yyyyyyy"

If you don't have API keys already, you may use "Raw Data" which i pulled from tweeter using:

You can specifiy amount of tweets you want to pull. Here I pulled 100

Import Necessary Libraries


In [4]:
import pandas as pd

import numpy as np

import re

import string

from nltk.corpus import stopwords

from wordcloud import WordCloud,STOPWORDS

from nltk.stem.porter import PorterStemmer

import nltk

from nltk.corpus import stopwords

import matplotlib

import matplotlib.pyplot as plt

from pandas.plotting import scatter_matrix

%matplotlib inline

from nltk.stem import WordNetLemmatizer

import seaborn as sns

sns.set(style="white",color_codes=True)

from sklearn.metrics import classification_report

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder,StandardScaler

sns.set(font_scale=1.5)

from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from textblob import TextBlob

from sklearn.metrics import confusion_matrix

from sklearn.metrics import classification_report

from sklearn.metrics import accuracy_score

from sklearn.metrics import precision_score

from sklearn.metrics import recall_score

from sklearn.metrics import f1_score

from sklearn.linear_model import LinearRegression,Ridge,Lasso,ElasticNet

from sklearn.tree import DecisionTreeRegressor

from sklearn import metrics

from nltk import word_tokenize

from wordcloud import WordCloud,STOPWORDS

stopword = set(stopwords.words('english'))

import tweepy as tw

localhost:8888/nbconvert/html/Documents/IT Courses/Machine Learning/Refrence folders/Natural Langage Processing/Sentiment Analysis/ Sentime… 1/22


2/7/22, 5:06 PM Sentiment Analysis - Comparing Algorithms Accuracy
import warnings

warnings.filterwarnings('ignore')

from matplotlib.axes._axes import _log as matplotlib_axes_logger

matplotlib_axes_logger.setLevel('ERROR')

Extract Data fromm Twitter


remove this cell if you don't have API key

In [ ]:
my_api_key = "xxxxxxxxxxxxxxxxxxx"

my_api_secret = "xxxxxxxxxxxxxxxxxxxxxxxx"

# authenticate

auth = tw.OAuthHandler(my_api_key, my_api_secret)

api = tw.API(auth, wait_on_rate_limit=True)

search_query = "#Ottawa -filter:retweet"

# tweets = tw.Cursor(api.search_tweets,q=search_query,lang="en",since="2015-09-16").item

tweets = tw.Cursor(api.search_tweets,q=search_query,lang="en").items(500)

# store the API responses in a list

tweets_copy = []

for tweet in tweets:

tweets_copy.append(tweet)

print("Total Tweets fetched:", len(tweets_copy))

# intialize the dataframe

data= pd.DataFrame()

# populate the dataframe

for tweet in tweets_copy:

hashtags = []

try:

for hashtag in tweet.entities["hashtags"]:

hashtags.append(hashtag["text"])

text = api.get_status(id=tweet.id, tweet_mode='extended').full_text

except:

pass

data = data.append(pd.DataFrame({'user_name': tweet.user.name,'ID': tweet.id_str,

'user_location': tweet.user.location,

'user_description': tweet.user.description,

'user_verified': tweet.user.verified,

'date': tweet.created_at,

'text': text,

'language': tweet.lang,

'favourites-count': tweet.favorite_count,

'author': tweet.user.screen_name,

'retweet-count': tweet.retweet_count,

'hashtags': [hashtags if hashtags else None],


'source': tweet.source}))

Total Tweets fetched: 500

Affter importing all libraries above, run this cell to load data.

localhost:8888/nbconvert/html/Documents/IT Courses/Machine Learning/Refrence folders/Natural Langage Processing/Sentiment Analysis/ Sentime… 2/22


2/7/22, 5:06 PM Sentiment Analysis - Comparing Algorithms Accuracy

In [2]: #load data from local drive

data = pd.read_csv("Raw Data.csv")

data.head()

Out[2]: Unnamed:
user_name ID user_location user_description user_verified
0

I'm a mushroom
Alexandria spore floating 2022-0
0 0 Horatio 1490817580352819203 False
,MN around Central 22:39:43+0
M...

Moscow, 2022-0
1 0 jeremy t 1490817570399666180 NaN False
Russia 22:39:41+0

Marie 2022-0
2 0 1490817565362327552 NaN NaN False
williams 22:39:40+0

el ch'val a 2022-0
3 0 1490817548153196546 NaN Ti Tannant False
coukse 22:39:36+0

#Christian | #Gay
Honk Honk Toronto, 2022-0
4 0 1490817547926708225 | #Torontonian False
🚛🚚🛻!!! Ontario 22:39:36+0
Philippians 4:13

In [ ]:
print(data.skew())

#If the value is closer to zero, then it shows less skew.

In [5]:
def clean_text(text):

'''Make text lowercase, remove text in square brackets,remove links,remove punctuat


and remove words containing numbers.'''

text = str(text).lower()

text = re.sub('\[.*?\]', '', text)

text = re.sub('https?://\S+|www\.\S+', '', text)

text = re.sub('<.*?>+', '', text)

text = re.sub('[%s]' % re.escape(string.punctuation), '', text)

text = re.sub('\n', '', text)

text = re.sub('\w*\d\w*', '', text)

return text

def process_tweets(tweet):

#tokenizing words

tokens = word_tokenize(tweet)

#Removing Stop Words

final_tokens = [w for w in tokens if w not in stopword]

#reducing a word to its word stem

localhost:8888/nbconvert/html/Documents/IT Courses/Machine Learning/Refrence folders/Natural Langage Processing/Sentiment Analysis/ Sentime… 3/22


2/7/22, 5:06 PM Sentiment Analysis - Comparing Algorithms Accuracy
wordLemm = WordNetLemmatizer()

finalwords=[]

for w in final_tokens:

if len(w)>1:

word = wordLemm.lemmatize(w)
finalwords.append(word)

return ' '.join(finalwords)

#Apply to relevant columns

data["user_description"] = data["user_description"].apply(lambda x:clean_text(x))

data["user_name"] = data["user_name"].apply(lambda x:clean_text(x))

data["text"] = data["text"].apply(lambda x:clean_text(x))

data['text'] = data["text"].apply(lambda x: process_tweets(x))

In [6]:
# Now we have cleaned data for three features: user_description, text, and user_name

# Although, we don't need more than text to perform our analysis

pd.DataFrame(data).head()

Out[6]:
user_name ID user_location user_description user_verified date

im a mushroom
rt po
Alexandria spore floating 2022-02-07
0 horatio 1490817580352819203 False ma
,MN around central 22:39:43+00:00
mi...

rt v
Moscow, 2022-02-07
0 jeremy t 1490817570399666180 False worke
Russia 22:39:41+00:00

rt k
marie 2022-02-07
0 1490817565362327552 False
williams 22:39:40+00:00

🖕
el chval a 2022-02-07
0 1490817548153196546 ti tannant False
coukse 22:39:36+00:00 🎶truc

christian gay rt v
honk honk Toronto, 2022-02-07
0 1490817547926708225 torontonian False worke
🚛🚚🛻 Ontario 22:39:36+00:00
philippians

In [7]:
data.to_csv("Clean Data.csv")

data.head()

Out[7]:
user_name ID user_location user_description user_verified date

im a mushroom
rt po
Alexandria spore floating 2022-02-07
0 horatio 1490817580352819203 False ma
,MN around central 22:39:43+00:00
mi...

localhost:8888/nbconvert/html/Documents/IT Courses/Machine Learning/Refrence folders/Natural Langage Processing/Sentiment Analysis/ Sentime… 4/22


2/7/22, 5:06 PM Sentiment Analysis - Comparing Algorithms Accuracy

user_name ID user_location user_description user_verified date

rt v
Moscow, 2022-02-07
0 jeremy t 1490817570399666180 False worke
Russia 22:39:41+00:00

rt k
marie 2022-02-07
0 1490817565362327552 False
williams 22:39:40+00:00

🖕
el chval a 2022-02-07
0 1490817548153196546 ti tannant False
coukse 22:39:36+00:00 🎶truc

christian gay rt v
honk honk Toronto, 2022-02-07
0 1490817547926708225 torontonian False worke
🚛🚚🛻 Ontario 22:39:36+00:00
philippians

Vader Sentiment Analysis


VADER sentimental analysis relies on a dictionary that maps lexical features to emotion intensities
known as sentiment scores. The sentiment score of a text can be obtained by summing up each
word's intensity in the text.

For example,- Words like 'love,' 'enjoy,' 'happy,' 'like' all convey a positive sentiment. Also, VADER is
intelligent enough to understand these words' basic context, such as "did not love" as a negative
statement. It also understands the emphasis of capitalization and punctuation, such as "ENJOY."

In [8]:
## Added "Sentiment" column and categorized in positive, negative and neutral

In [9]:
sid = SIA()

data['Sentiments'] = data['text'].apply(lambda x: sid.polarity_scores(' '.joi


data['Positive Sentiment'] = data['Sentiments'].apply(lambda x: x['pos']+1*(10**-6))

data['Neutral Sentiment'] = data['Sentiments'].apply(lambda x: x['neu']+1*(10**-6))

data['Negative Sentiment'] = data['Sentiments'].apply(lambda x: x['neg']+1*(10**-6))

In [10]:
# drop sentiments column... not needed

data.drop(columns=['Sentiments'],inplace=True)

data.head()

Out[10]:
user_name ID user_location user_description user_verified date

im a mushroom
rt po
Alexandria spore floating 2022-02-07
0 horatio 1490817580352819203 False ma
,MN around central 22:39:43+00:00
mi...

localhost:8888/nbconvert/html/Documents/IT Courses/Machine Learning/Refrence folders/Natural Langage Processing/Sentiment Analysis/ Sentime… 5/22


2/7/22, 5:06 PM Sentiment Analysis - Comparing Algorithms Accuracy

user_name ID user_location user_description user_verified date

rt v
Moscow, 2022-02-07
0 jeremy t 1490817570399666180 False worke
Russia 22:39:41+00:00

rt k
marie 2022-02-07
0 1490817565362327552 False
williams 22:39:40+00:00

🖕
el chval a 2022-02-07
0 1490817548153196546 ti tannant False
coukse 22:39:36+00:00 🎶truc

christian gay rt v
honk honk Toronto, 2022-02-07
0 1490817547926708225 torontonian False worke
🚛🚚🛻 Ontario 22:39:36+00:00
philippians

In [11]:
#Number of Words

data['Number of Words'] =data.text.apply(lambda x:len(x.split(' ')))

#Average Word Length

data['Mean Word Length'] = data.text.apply(lambda x:np.round(np.mean([len(w) for w in x


data.head()

Out[11]:
user_name ID user_location user_description user_verified date

im a mushroom
rt po
Alexandria spore floating 2022-02-07
0 horatio 1490817580352819203 False ma
,MN around central 22:39:43+00:00
mi...

rt v
Moscow, 2022-02-07
0 jeremy t 1490817570399666180 False worke
Russia 22:39:41+00:00

rt k
marie 2022-02-07
0 1490817565362327552 False
williams 22:39:40+00:00

🖕
el chval a 2022-02-07
0 1490817548153196546 ti tannant False
coukse 22:39:36+00:00 🎶truc

christian gay rt v
honk honk Toronto, 2022-02-07
0 1490817547926708225 torontonian False worke
🚛🚚🛻 Ontario 22:39:36+00:00
philippians

In [12]:
# WordCloud using atual clean data

#allWords = ' '.join( [cmts for cmts in data.text])

#wordCloud = WordCloud(width = 500, height = 300, random_state = 21, max_font_size = 11


localhost:8888/nbconvert/html/Documents/IT Courses/Machine Learning/Refrence folders/Natural Langage Processing/Sentiment Analysis/ Sentime… 6/22


2/7/22, 5:06 PM Sentiment Analysis - Comparing Algorithms Accuracy

#plt.imshow(wordCloud, interpolation= 'bilinear')

#plt.axis('off')

#plt.show

Sentimental Analysis

Polarity and Subjectivity

In starting with the analysis we will create the new columns namely Polarity and Subjectivity and
acquire the very values of each comment. Polarity ranges from -1 to 1 and measures how positive or
negative a comment is. It simply means emotions expressed in a sentence. Subjectivity expresses
some personal feelings, views, or beliefs. A subjective sentence may not express any sentiment.

In [13]:
# get subjectivity

def getSubjectivity(txt):

return TextBlob(txt).sentiment.subjectivity

# get polarity

def getPolarity(txt):

return TextBlob(txt).sentiment.polarity

#Columns

data['Subjectivity'] = data['text'].apply(getSubjectivity)

data['Polarity'] = data['text'].apply(getPolarity)

data.head()

Out[13]:
user_name ID user_location user_description user_verified date

im a mushroom
rt po
Alexandria spore floating 2022-02-07
0 horatio 1490817580352819203 False ma
,MN around central 22:39:43+00:00
mi...

rt v
Moscow, 2022-02-07
0 jeremy t 1490817570399666180 False worke
Russia 22:39:41+00:00

rt k
marie 2022-02-07
0 1490817565362327552 False
williams 22:39:40+00:00

🖕
el chval a 2022-02-07
0 1490817548153196546 ti tannant False
coukse 22:39:36+00:00 🎶truc

christian gay rt v
honk honk Toronto, 2022-02-07
0 1490817547926708225 torontonian False worke
🚛🚚🛻 Ontario 22:39:36+00:00
philippians

In [14]:
# function to compute analysis

def getAnalysis(score):

localhost:8888/nbconvert/html/Documents/IT Courses/Machine Learning/Refrence folders/Natural Langage Processing/Sentiment Analysis/ Sentime… 7/22


2/7/22, 5:06 PM Sentiment Analysis - Comparing Algorithms Accuracy

if score < 0 :

return 'Negative'

elif score == 0:

return 'Neutral'

else:

return
'Positive'

data['Analysis'] = data['Polarity'].apply(getAnalysis)

In [15]:
data.head()

Out[15]:
user_name ID user_location user_description user_verified date

im a mushroom
rt po
Alexandria spore floating 2022-02-07
0 horatio 1490817580352819203 False ma
,MN around central 22:39:43+00:00
mi...

rt v
Moscow, 2022-02-07
0 jeremy t 1490817570399666180 False worke
Russia 22:39:41+00:00

rt k
marie 2022-02-07
0 1490817565362327552 False
williams 22:39:40+00:00

🖕
el chval a 2022-02-07
0 1490817548153196546 ti tannant False
coukse 22:39:36+00:00 🎶truc

christian gay rt v
honk honk Toronto, 2022-02-07
0 1490817547926708225 torontonian False worke
🚛🚚🛻 Ontario 22:39:36+00:00
philippians

5 rows × 21 columns

In [16]:
# % Percentages:

pcomments = data[data.Analysis == 'Positive']

pcomments = pcomments['text']

print('Positive: ' +str(round((pcomments.shape[0]/data.shape[0])*100, 1))+ '%')

ncomments = data[data.Analysis == 'Negative']

ncomments = ncomments['text']

print('Negative: ' +str(round((ncomments.shape[0]/data.shape[0])*100, 1))+ '%')

nucomments = data[data.Analysis == 'Neutral']

nucomments = nucomments['text']

print('Nuetral: ' +str(round((nucomments.shape[0]/data.shape[0])*100, 1))+ '%')

Positive: 28.4%

localhost:8888/nbconvert/html/Documents/IT Courses/Machine Learning/Refrence folders/Natural Langage Processing/Sentiment Analysis/ Sentime… 8/22


2/7/22, 5:06 PM Sentiment Analysis - Comparing Algorithms Accuracy

Negative: 33.6%

Nuetral: 38.0%

In [17]:
# the below function will create a word cloud

def wordcloud_draw(data, color = 'black'):

words = ' '.join(data)

cleaned_word = " ".join([word for word in words.split()

if 'http' not in word # double check for nay links

and not word.startswith('#') # removing hash tags

and word != 'rt'

])

wordcloud = WordCloud(stopwords=STOPWORDS, # using stopwords provided by Word cloud


background_color=color,

width=2500,

height=2000

).generate(cleaned_word)

# using matplotlib to display the images in notebook itself.

plt.figure(1,figsize=(5, 7))

plt.imshow(wordcloud)

plt.axis('off')

plt.show()

In [18]:
wordcloud_draw(data.text, 'black')

In [19]:
print("Positive words are", pcomments.count())

wordcloud_draw(pcomments, 'black')

Positive words are 142

localhost:8888/nbconvert/html/Documents/IT Courses/Machine Learning/Refrence folders/Natural Langage Processing/Sentiment Analysis/ Sentime… 9/22


2/7/22, 5:06 PM Sentiment Analysis - Comparing Algorithms Accuracy

In [20]:
print("Negative words are", ncomments.count())

wordcloud_draw(ncomments)

Negative words are 168

In [21]:
print("Neutral words are", nucomments.count())

wordcloud_draw(nucomments, 'black')

Neutral words are 190

In [22]:
# Value Count

data['Analysis'].value_counts

localhost:8888/nbconvert/html/Documents/IT Courses/Machine Learning/Refrence folders/Natural Langage Processing/Sentiment Analysis/ Sentim… 10/22


2/7/22, 5:06 PM Sentiment Analysis - Comparing Algorithms Accuracy

# Plot

plt.title('Sentiment Analysis')

plt.xlabel('Sentiment')

plt.ylabel('Counts')

data['Analysis'].value_counts().plot(kind= 'bar')

plt.show()

More on sentiment analysis: https://www.projectpro.io/article/sentiment-analysis-project-ideas-


with-source-code/518

Check Analysis Accuracy


In [23]:
data.isnull().sum()

user_name 0

Out[23]:
ID 0

user_location 0

user_description 0

user_verified 0

date 0

text 0

language 0

favourites-count 0

author 0

retweet-count 0

hashtags 112

source 0

Positive Sentiment 0

Neutral Sentiment 0

Negative Sentiment 0

Number of Words 0

Mean Word Length 0

Subjectivity 0

localhost:8888/nbconvert/html/Documents/IT Courses/Machine Learning/Refrence folders/Natural Langage Processing/Sentiment Analysis/ Sentim… 11/22


2/7/22, 5:06 PM Sentiment Analysis - Comparing Algorithms Accuracy
Polarity 0

Analysis 0

dtype: int64

In [24]:
data.shape

(500, 21)
Out[24]:

In [25]:
data.dropna(inplace=True)

data.isnull().sum()

user_name 0

Out[25]:
ID 0

user_location 0

user_description 0

user_verified 0

date 0

text 0

language 0

favourites-count 0

author 0

retweet-count 0

hashtags 0

source 0

Positive Sentiment 0

Neutral Sentiment 0

Negative Sentiment 0

Number of Words 0

Mean Word Length 0

Subjectivity 0

Polarity 0

Analysis 0

dtype: int64

In [26]:
data.shape

(388, 21)
Out[26]:

In [27]:
data.columns

Index(['user_name', 'ID', 'user_location', 'user_description', 'user_verified',

Out[27]:
'date', 'text', 'language', 'favourites-count', 'author',

'retweet-count', 'hashtags', 'source', 'Positive Sentiment',

'Neutral Sentiment', 'Negative Sentiment', 'Number of Words',

'Mean Word Length', 'Subjectivity', 'Polarity', 'Analysis'],

dtype='object')

In [28]:
# drop irrelevant data

data = data.drop(['user_name', 'ID','language', 'author','Positive Sentiment',

'Neutral Sentiment', 'Negative Sentiment', 'Number of Words','Polarity',

'Mean Word Length','hashtags'], axis=1)

In [29]:
# check data types and encode object type

data.dtypes

localhost:8888/nbconvert/html/Documents/IT Courses/Machine Learning/Refrence folders/Natural Langage Processing/Sentiment Analysis/ Sentim… 12/22


2/7/22, 5:06 PM Sentiment Analysis - Comparing Algorithms Accuracy

user_location object

Out[29]:
user_description object

user_verified bool

date datetime64[ns, UTC]

text object

favourites-count int64

retweet-count int64

source object

Subjectivity float64

Analysis object

dtype: object

In [30]:
enco = LabelEncoder()

data['user_location'] = enco.fit_transform(data['user_location'])

data['user_description'] = enco.fit_transform(data['user_description'])

data['user_verified'] = enco.fit_transform(data['user_verified'])

data['text'] = enco.fit_transform(data['text'])

data['date'] = enco.fit_transform(data['date'])

data['source'] = enco.fit_transform(data['source'])

data['Analysis'] = enco.fit_transform(data['Analysis'])

In [31]:
data.head()

Out[31]: favourites- retweet-


user_location user_description user_verified date text source Subjectivity
count count

0 7 127 0 318 94 0 1889 2 0.000000

0 0 0 0 317 78 0 790 1 0.357143

0 0 241 0 316 117 0 0 2 0.000000

0 94 176 0 315 73 0 21 4 0.000000

0 132 91 0 314 61 0 3492 1 0.400000

In [32]:
X = data.drop(["Analysis"], axis=1)

y= data.Analysis

In [33]:
from sklearn.decomposition import PCA

pca = PCA(n_components=3)

fit = pca.fit(X)

fit.explained_variance_ratio_

print(fit.components_)

[[ 3.74714044e-05 2.79055233e-03 -2.44727802e-06 -5.05181496e-03

-9.07495858e-04 -2.10093852e-04 9.99982911e-01 3.53673754e-05

3.97336839e-06]

[ 1.55773098e-01 9.59450987e-01 5.36402446e-05 2.34913945e-01

-7.54200772e-04 -8.93911860e-04 -1.49736983e-03 -6.07769299e-04

4.75840743e-05]

[ 3.68001614e-02 2.31975421e-01 1.55584824e-05 -9.71914540e-01

localhost:8888/nbconvert/html/Documents/IT Courses/Machine Learning/Refrence folders/Natural Langage Processing/Sentiment Analysis/ Sentim… 13/22


2/7/22, 5:06 PM Sentiment Analysis - Comparing Algorithms Accuracy
-1.31315953e-02 3.33579373e-03 -5.56999131e-03 8.29295615e-04

-3.09667478e-05]]

In [34]:
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.10,random_state=1)

In [35]:
#Feature Scaling/Standardize (not important step but it boost accuracy)

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

x_train = sc.fit_transform(x_train)

x_test = sc.transform(x_test)

In [36]:
print (x_train.shape, y_train.shape)

print (x_test.shape, y_test.shape)

(349, 9) (349,)

(39, 9) (39,)

In [64]:
#Gaussian Naive Bayes model

from sklearn.naive_bayes import GaussianNB # import library

classifier = GaussianNB() # initilaise

classifier.fit(x_train,y_train) # fit train dataset

y_pred = classifier.predict(x_test) # predict

# check for accuracy

from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_pred)

print(cm)

accuracy_score(y_test, y_pred)

accuracy = classifier.score(x_test, y_test)

Gaussian_Naive_Bayes = ("Gaussian_Naive_Bayes Accuracy: {:.2f}%".format(accuracy*100))

# put prediction and actual values side-by-side

pd.DataFrame(data={'predictions': y_pred, 'actual': y_test}).head()

[[15 0 0]

[ 0 15 0]

[ 9 0 0]]

Out[64]: predictions actual

0 0 2

0 0 0

0 1 1

0 0 0

0 0 0

In [65]:
#DecisionTree Classifier

from sklearn.tree import DecisionTreeClassifier

localhost:8888/nbconvert/html/Documents/IT Courses/Machine Learning/Refrence folders/Natural Langage Processing/Sentiment Analysis/ Sentim… 14/22


2/7/22, 5:06 PM Sentiment Analysis - Comparing Algorithms Accuracy
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)

classifier.fit(x_train, y_train)

y_predict = classifier.predict(x_test)

# check for accuracy

from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_predict)
print(cm)

accuracy_score(y_test, y_predict)
accuracy = classifier.score(x_test, y_test)

DecisionTree=("DecisionTree Accuracy: {:.2f}%".format(accuracy*100))

# put prediction and actual values side-by-side

pd.DataFrame(data={'predictions': y_predict, 'actual': y_test}).head()

[[14 0 1]

[ 0 15 0]

[ 0 0 9]]

Out[65]: predictions actual

0 2 2

0 0 0

0 1 1

0 0 0

0 0 0

In [66]:
#K-Nearest Neighbors (K-NN)

from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)

classifier.fit(x_train, y_train)

# check for accuracy

from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_predict)
print(cm)

accuracy_score(y_test, y_predict)
accuracy = classifier.score(x_test, y_test)

K_Nearest_Neighbor=("K_Nearest_Neighbor Accuracy: {:.2f}%".format(accuracy*100))

# put prediction and actual values side-by-side

pd.DataFrame(data={'predictions': y_predict, 'actual': y_test}).head()

[[14 0 1]

[ 0 15 0]

[ 0 0 9]]

Out[66]: predictions actual

0 2 2

0 0 0

0 1 1

localhost:8888/nbconvert/html/Documents/IT Courses/Machine Learning/Refrence folders/Natural Langage Processing/Sentiment Analysis/ Sentim… 15/22


2/7/22, 5:06 PM Sentiment Analysis - Comparing Algorithms Accuracy

predictions actual

0 0 0

0 0 0

In [67]:
#Kernel SVM

from sklearn.svm import SVC

classifier = SVC(kernel = 'rbf', random_state = 0)

classifier.fit(x_train, y_train)

# check for accuracy

from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_predict)
print(cm)

accuracy_score(y_test, y_predict)
accuracy = classifier.score(x_test, y_test)

Kernel_SVM=("Kernel_SVM Accuracy: {:.2f}%".format(accuracy*100))

# put prediction and actual values side-by-side

pd.DataFrame(data={'predictions': y_predict, 'actual': y_test}).head()

[[14 0 1]

[ 0 15 0]

[ 0 0 9]]

Out[67]: predictions actual

0 2 2

0 0 0

0 1 1

0 0 0

0 0 0

In [68]:
#Logistic Regression

from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(random_state = 0)

classifier.fit(x_train, y_train)

# check for accuracy

from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_predict)
print(cm)

accuracy_score(y_test, y_predict)
accuracy = classifier.score(x_test, y_test)

Logistic_Regression=("Logistic_Regression Accuracy: {:.2f}%".format(accuracy*100))

[[14 0 1]

[ 0 15 0]

[ 0 0 9]]

In [69]:
#Random Tree Regression

from sklearn.ensemble import RandomForestClassifier

localhost:8888/nbconvert/html/Documents/IT Courses/Machine Learning/Refrence folders/Natural Langage Processing/Sentiment Analysis/ Sentim… 16/22


2/7/22, 5:06 PM Sentiment Analysis - Comparing Algorithms Accuracy
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_st
classifier.fit(x_train, y_train)

# check for accuracy

from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_predict)
print(cm)

accuracy_score(y_test, y_predict)
accuracy = classifier.score(x_test, y_test)

Random_Tree_Regression=("Random_Tree_Regression Accuracy: {:.2f}%".format(accuracy*100)


# put prediction and actual values side-by-side

pd.DataFrame(data={'predictions': y_predict, 'actual': y_test}).head()

[[14 0 1]

[ 0 15 0]

[ 0 0 9]]

Out[69]: predictions actual

0 2 2

0 0 0

0 1 1

0 0 0

0 0 0

In [70]:
print(Gaussian_Naive_Bayes)

print(DecisionTree)

print(Kernel_SVM)

print(Logistic_Regression)
print(Random_Tree_Regression)

Gaussian_Naive_Bayes Accuracy: 76.92%

DecisionTree Accuracy: 97.44%

Kernel_SVM Accuracy: 92.31%

Logistic_Regression Accuracy: 92.31%

Random_Tree_Regression Accuracy: 94.87%

In [71]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)

pca.fit_transform(X)

x_pca = pca.transform(X)

In [72]:
plt.figure(figsize=(9,6))

plt.scatter(x_pca[:,0],x_pca[:,1],c=y,cmap='viridis')

plt.xlabel('First Principal Component')

plt.ylabel('Second Principal Component')

Text(0, 0.5, 'Second Principal Component')


Out[72]:

localhost:8888/nbconvert/html/Documents/IT Courses/Machine Learning/Refrence folders/Natural Langage Processing/Sentiment Analysis/ Sentim… 17/22


2/7/22, 5:06 PM Sentiment Analysis - Comparing Algorithms Accuracy

In [45]:
from sklearn.metrics import confusion_matrix, accuracy_score

def plot_confusion_matrix(cm, classes,

normalize=False,

title='Confusion matrix',

cmap=plt.cm.Blues):

"""

This function prints and plots the confusion matrix.

Normalization can be applied by setting `normalize=True`.

"""

plt.imshow(cm, interpolation='nearest', cmap=cmap)


plt.title(title)

plt.colorbar()

tick_marks = np.arange(len(classes))

plt.xticks(tick_marks, classes, rotation=45)

plt.yticks(tick_marks, classes)

if normalize:

cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

print("Normalized confusion matrix")

else:

print('Confusion matrix, without normalization')

print(cm)

thresh = cm.max() / 2.

for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):

plt.text(j, i, cm[i, j],

horizontalalignment="center",

color="white" if cm[i, j] > thresh else "black")

plt.tight_layout()

plt.ylabel('True label')

plt.xlabel('Predicted label')

localhost:8888/nbconvert/html/Documents/IT Courses/Machine Learning/Refrence folders/Natural Langage Processing/Sentiment Analysis/ Sentim… 18/22


2/7/22, 5:06 PM Sentiment Analysis - Comparing Algorithms Accuracy
# Compute confusion matrix

cnf_matrix = confusion_matrix(y_test, y_pred)

In [57]:
import itertools

plt.figure(figsize=(7,5))

plot_confusion_matrix(cnf_matrix, classes=['1','2','3'],title='Confusion matrix, withou


accuracy_score(y_test, y_pred)

accuracy = classifier.score(x_test, y_test)

print()

print("Accuracy: {:.2f}%".format(accuracy*100))

Confusion matrix, without normalization

[[16 0 2]

[ 0 0 17]

[ 3 0 1]]

Accuracy: 89.74%

In [47]:
plt.figure(figsize=(20,7))

sns.heatmap(data.corr(), annot = True)

<AxesSubplot:>
Out[47]:

localhost:8888/nbconvert/html/Documents/IT Courses/Machine Learning/Refrence folders/Natural Langage Processing/Sentiment Analysis/ Sentim… 19/22


2/7/22, 5:06 PM Sentiment Analysis - Comparing Algorithms Accuracy

In [48]:
print(classification_report(y_test,y_pred))

precision recall f1-score support

0 0.84 0.89 0.86 18

1 0.00 0.00 0.00 17

2 0.05 0.25 0.08 4

accuracy 0.44 39

macro avg 0.30 0.38 0.32 39

weighted avg 0.39 0.44 0.41 39

In [49]:
data.columns

Index(['user_location', 'user_description', 'user_verified', 'date', 'text',

Out[49]:
'favourites-count', 'retweet-count', 'source', 'Subjectivity',

'Analysis'],

dtype='object')

In [50]:
from sklearn.metrics import mean_squared_error,r2_score

from math import sqrt

In [51]:
classifier = LogisticRegression(random_state = 0)

classifier.fit(x_train, y_train)

coeff = list(classifier.coef_[0])

labels = list(data[['user_location', 'user_description', 'user_verified', 'date', 'text


'favourites-count', 'retweet-count', 'source',

'Analysis']])

features = pd.DataFrame()

features['Features'] = labels

features['importance'] = coeff

features.sort_values(by=['importance'], ascending=True, inplace=True)

features['positive'] = features['importance'] > 0

features.set_index('Features', inplace=True)

localhost:8888/nbconvert/html/Documents/IT Courses/Machine Learning/Refrence folders/Natural Langage Processing/Sentiment Analysis/ Sentim… 20/22


2/7/22, 5:06 PM Sentiment Analysis - Comparing Algorithms Accuracy
features.importance.plot(kind='barh', figsize=(11, 6),color = features.positive.map({Tr
plt.xlabel('Importance')

Text(0.5, 0, 'Importance')
Out[51]:

In [56]:
import statsmodels.formula.api as smf

model = smf.ols("Analysis ~ user_location+user_description+text+source", data = data).f


print(model.summary())

OLS Regression Results

==============================================================================

Dep. Variable: Analysis R-squared: 0.040

Model: OLS Adj. R-squared: 0.030

Method: Least Squares F-statistic: 3.963

Date: Mon, 07 Feb 2022 Prob (F-statistic): 0.00366

Time: 16:26:12 Log-Likelihood: -405.00

No. Observations: 383 AIC: 820.0

Df Residuals: 378 BIC: 839.7

Df Model: 4

Covariance Type: nonrobust

====================================================================================

coef std err t P>|t| [0.025 0.975]

------------------------------------------------------------------------------------

Intercept 0.5010 0.161 3.110 0.002 0.184 0.818

user_location 0.0006 0.001 0.854 0.393 -0.001 0.002

user_description -0.0003 0.000 -0.748 0.455 -0.001 0.000

text 0.0053 0.001 3.796 0.000 0.003 0.008

source -0.0322 0.028 -1.146 0.253 -0.088 0.023

==============================================================================

Omnibus: 35.208 Durbin-Watson: 1.964

Prob(Omnibus): 0.000 Jarque-Bera (JB): 31.702

Skew: 0.631 Prob(JB): 1.31e-07

Kurtosis: 2.374 Cond. No. 780.

==============================================================================

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specifi
ed.

localhost:8888/nbconvert/html/Documents/IT Courses/Machine Learning/Refrence folders/Natural Langage Processing/Sentiment Analysis/ Sentim… 21/22


2/7/22, 5:06 PM Sentiment Analysis - Comparing Algorithms Accuracy

In [ ]:

localhost:8888/nbconvert/html/Documents/IT Courses/Machine Learning/Refrence folders/Natural Langage Processing/Sentiment Analysis/ Sentim… 22/22

You might also like