Sentiment Analysis Using NLP
Sentiment Analysis Using NLP
Import Data
In [ ]: 1 import pandas as pd
2 import matplotlib.pyplot as plt
Mounted at /content/drive
1 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
In [11]: 1 data.head()
HOLLYWEIRD LIB SUSAN SARANDON Compares There are two small problems with your
0 Dec 30, 2015 Fake News
Muslim ... analogy...
4 Britain says window to restore Northern Irelan... BELFAST (Reuters) - Northern Ireland s politic... September 4, 2017 Factual News
In [12]: 1 data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198 entries, 0 to 197
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 198 non-null object
1 text 198 non-null object
2 date 198 non-null object
3 fake_or_factual 198 non-null object
dtypes: object(4)
memory usage: 6.3+ KB
2 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
3 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
In [ ]: 1
In [ ]: 1
In [ ]: 1
4 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
POS Tagging
In [15]: 1 nlp = spacy.load('en_core_web_sm')
5 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
In [18]: 1 # create function to extract tags for each document in our data
2 def extract_token_tags(doc:spacy.tokens.doc.Doc):
3 return [(i.text, i.ent_type_, i.pos_) for i in doc]
6 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
In [20]: 1 fake_tagsdf.head()
0 There PRON
1 are VERB
3 small ADJ
4 problems NOUN
Categorical distributions
7 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
8 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
28 , PUNCT 1908
39 . PUNCT 1531
0 SPACE 795
9 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
15 , PUNCT 1698
22 . PUNCT 1381
Distributions
Categorical distributions
10 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
2-d distributions
Values
Faceted distributions
11 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
Out[23]: pos_tag
NOUN 2597
VERB 1814
PROPN 1657
ADJ 876
ADV 412
NUM 221
PRON 99
ADP 88
AUX 58
SCONJ 54
Name: token, dtype: int64
In [24]: 1 pos_counts_fact.groupby(['pos_tag'])['token'].count().sort_values(ascending=False).head(10)
Out[24]: pos_tag
NOUN 2182
VERB 1535
PROPN 1387
ADJ 753
ADV 271
NUM 203
PRON 81
ADP 70
AUX 44
SCONJ 39
Name: token, dtype: int64
12 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
7344 t NOUN 40
13 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
Named Entities
In [27]: 1 # top entities in fake news
2 top_entities_fake = fake_tagsdf[fake_tagsdf['ner_tag'] != ""] \
3 .groupby(['token','ner_tag']).size().reset_index(name='counts') \
4 .sort_values(by='counts', ascending=False)
14 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
15 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
In [30]: 1 sns.barplot(
2 x = 'counts',
3 y = 'token',
4 hue = 'ner_tag',
5 palette = ner_palette,
6 data = top_entities_fake[0:10],
7 orient = 'h',
8 dodge=False
9 ) \
10 .set(title='Most Common Entities in Fake News')
16 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
In [ ]: 1
In [ ]: 1
17 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
In [31]: 1 sns.barplot(
2 x = 'counts',
3 y = 'token',
4 hue = 'ner_tag',
5 palette = ner_palette,
6 data = top_entities_fact[0:10],
7 orient = 'h',
8 dodge=False
9 ) \
10 .set(title='Most Common Entities in Factual News')
18 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
In [ ]: 1
In [ ]: 1
In [ ]: 1
In [ ]: 1
In [ ]: 1
In [ ]: 1
In [ ]: 1
In [ ]: 1
In [ ]: 1
In [ ]: 1
In [ ]: 1
In [ ]: 1
Text Pre-processing
In [32]: 1 # a lot of the factual news has a location tag at the beginning of the article, let's use regex to re
2 data['text_clean'] = data.apply(lambda x: re.sub(r"^[^-]*-\s*", "", x['text']), axis=1)
19 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
In [33]: 1 # lowercase
2 data['text_clean'] = data['text_clean'].str.lower()
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'l
l", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "sh
e's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs',
'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am',
'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'o
f', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'befor
e', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'und
er', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', '
any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only',
'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'sho
uld', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn
', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven',
"haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 's
han', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wo
uldn', "wouldn't"]
In [39]: 1 # tokenize
2 nltk.download('punkt')
3 data['text_clean'] = data.apply(lambda x: word_tokenize(x['text_clean']), axis=1)
20 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
In [41]: 1 # lemmatize
2 nltk.download('wordnet')
3 lemmatizer = WordNetLemmatizer()
4 data["text_clean"] = data["text_clean"].apply(lambda tokens: [lemmatizer.lemmatize(token) for token
In [42]: 1 data.head()
HOLLYWEIRD LIB SUSAN There are two small problems with [yearold, oscarwinning, actress,
0 Dec 30, 2015 Fake News
SARANDON Compares Muslim ... your analogy... described, me...
Elijah Cummings Called Trump Out To Buried in Trump s bonkers [buried, trump, bonkers,
1 April 6, 2017 Fake News
His Face ... interview with New Y... interview, new, york,...
Hillary Clinton Says Half Her Cabinet Women make up over 50 percent [woman, make, 50, percent,
2 April 26, 2016 Fake News
Will Be... of this country,... country, grossly, u...
Russian bombing of U.S.-backed WASHINGTON (Reuters) - U.S. September 18, [u, defense, secretary, jim,
3 Factual News
forces being di... Defense Secretary ... 2017 mattis, said, mon...
Britain says window to restore BELFAST (Reuters) - Northern September 4, [northern, ireland, political,
4 Factual News
Northern Irelan... Ireland s politic... 2017 party, rapidly,...
(said,) 560
(trump,) 520
(u,) 255
(state,) 250
(president,) 226
(would,) 210
(one,) 141
(year,) 128
(republican,) 128
(also,) 124
dtype: int64
21 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
In [ ]: 1
In [ ]: 1
22 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
23 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
In [ ]: 1
In [ ]: 1
(donald, trump) 92
(united, state) 80
(white, house) 72
(president, donald) 42
(hillary, clinton) 31
(new, york) 31
(image, via) 29
(supreme, court) 29
(official, said) 26
(food, stamp) 24
dtype: int64
Sentiment Analysis
In [46]: 1 # use vader so we also get a neutral sentiment count
2 vader_sentiment = SentimentIntensityAnalyzer()
24 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
In [49]: 1 data['vader_sentiment_label'].value_counts().plot.bar(color=default_plot_colour)
In [ ]: 1
25 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
In [ ]: 1
26 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
In [50]: 1 sns.countplot(
2 x = 'fake_or_factual',
3 hue = 'vader_sentiment_label',
4 palette = sns.color_palette("hls"),
5 data = data
6 ) \
7 .set(title='Sentiment by News Type')
27 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
In [ ]: 1
In [ ]: 1
In [ ]: 1
In [ ]: 1
In [ ]: 1
In [ ]: 1
In [ ]: 1
In [ ]: 1
In [ ]: 1
In [ ]: 1
In [ ]: 1
In [ ]: 1
LDA
In [51]: 1 # fake news data vectorization
2 fake_news_text = data[data['fake_or_factual'] == "Fake News"]['text_clean'].reset_index(drop=True
3 dictionary_fake = corpora.Dictionary(fake_news_text)
4 doc_term_fake = [dictionary_fake.doc2bow(text) for text in fake_news_text]
28 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
29 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
30 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
Out[53]: [(0,
'0.009*"trump" + 0.004*"food" + 0.004*"said" + 0.003*"u" + 0.003*"stamp" + 0.003*"state" + 0.00
3*"time" + 0.003*"million" + 0.003*"president" + 0.003*"woman"'),
(1,
'0.011*"trump" + 0.007*"said" + 0.005*"president" + 0.004*"clinton" + 0.004*"one" + 0.004*"tim
e" + 0.003*"obama" + 0.003*"state" + 0.003*"would" + 0.003*"u"'),
(2,
'0.015*"trump" + 0.005*"would" + 0.005*"president" + 0.003*"clinton" + 0.003*"student" + 0.003
*"u" + 0.003*"woman" + 0.003*"people" + 0.003*"one" + 0.003*"year"'),
(3,
'0.010*"trump" + 0.006*"said" + 0.006*"state" + 0.005*"republican" + 0.005*"president" + 0.004
*"clinton" + 0.004*"time" + 0.004*"would" + 0.003*"woman" + 0.003*"people"'),
(4,
'0.008*"trump" + 0.005*"clinton" + 0.004*"state" + 0.004*"one" + 0.004*"u" + 0.004*"would" + 0.
003*"said" + 0.003*"mccain" + 0.003*"people" + 0.003*"official"')]
In [54]: 1 # our topics contain a lot of very similar words, let's try using latent semantic anaysis with tf-idf
31 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
32 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
33 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
Out[58]: [(0,
'0.218*"trump" + 0.135*"clinton" + 0.094*"woman" + 0.087*"president" + 0.086*"republican" + 0.0
85*"obama" + 0.084*"party" + 0.083*"school" + 0.081*"said" + 0.079*"time"'),
(1,
'-0.299*"boiler" + -0.253*"room" + -0.250*"acr" + -0.186*"jay" + -0.185*"animal" + -0.176*"epis
ode" + -0.147*"analysis" + -0.122*"dyer" + -0.119*"corner" + -0.119*"spore"'),
(2,
'-0.218*"school" + 0.194*"clinton" + 0.165*"conference" + -0.151*"county" + -0.136*"student" +
0.120*"press" + 0.116*"trump" + 0.112*"hillary" + -0.101*"love" + 0.096*"email"')]
34 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
In [59]: 1 data.head()
HOLLYWEIRD LIB
[yearold,
SUSAN There are two
Dec 30, oscarwinning,
0 SARANDON small problems Fake News -0.3660 negative
2015 actress,
Compares Muslim with your analogy...
described, me...
...
[buried, trump,
Elijah Cummings Buried in Trump s
April 6, bonkers,
1 Called Trump Out bonkers interview Fake News -0.8197 negative
2017 interview, new,
To His Face ... with New Y...
york,...
[woman, make,
Hillary Clinton Says Women make up
April 26, 50, percent,
2 Half Her Cabinet over 50 percent of Fake News 0.9779 positive
2016 country, grossly,
Will Be... this country,...
u...
35 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
Out[65]: 0.8833333333333333
accuracy 0.88 60
macro avg 0.89 0.88 0.88 60
weighted avg 0.89 0.88 0.88 60
Out[69]: 0.8666666666666667
accuracy 0.87 60
macro avg 0.87 0.87 0.87 60
weighted avg 0.87 0.87 0.87 60
In [ ]: 1
36 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...
37 of 37 11/16/23, 1:01 PM