0% found this document useful (0 votes)
12 views37 pages

Sentiment Analysis Using NLP

The document discusses using natural language processing techniques like part-of-speech tagging and sentiment analysis to analyze fake and factual news articles. It imports data on fake and factual news articles, explores the data, and imports necessary NLP packages. The goal is to develop a system that can identify fake news by analyzing the text of news articles.

Uploaded by

D. Notiam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views37 pages

Sentiment Analysis Using NLP

The document discusses using natural language processing techniques like part-of-speech tagging and sentiment analysis to analyze fake and factual news articles. It imports data on fake and factual news articles, explores the data, and imports necessary NLP packages. The goal is to develop a system that can identify fake news by analyzing the text of news articles.

Uploaded by

D. Notiam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

Sentiment Analysis Use Case With the implementation of Natural Language


Processing (NLP)
Assume you are employed by a social media business. The increasing quantity of false information circulating on its platform
worries the corporation. You have been tasked with finding out how to spot fake news and developing a system to do so. Together,
let's investigate and tidy up the data before attempting to categorise fabricated vs real news reports. We'll also talk about how we
might present our results to stakeholders and make some charts of our outputs.

Import Data
In [ ]: 1 import pandas as pd
2 import matplotlib.pyplot as plt

In [ ]: 1 # set plot options


2 plt.rcParams['figure.figsize'] = (12, 8)
3 default_plot_colour = "#00bfbf"

In [ ]: 1 from google.colab import drive


2 drive.mount('/content/drive')

Mounted at /content/drive

In [10]: 1 data = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/NLP/fake_news_data.csv")

1 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [11]: 1 data.head()

Out[11]: title text date fake_or_factual

HOLLYWEIRD LIB SUSAN SARANDON Compares There are two small problems with your
0 Dec 30, 2015 Fake News
Muslim ... analogy...

Buried in Trump s bonkers interview with New


1 Elijah Cummings Called Trump Out To His Face ... April 6, 2017 Fake News
Y...

Women make up over 50 percent of this


2 Hillary Clinton Says Half Her Cabinet Will Be... April 26, 2016 Fake News
country,...

WASHINGTON (Reuters) - U.S. Defense September 18,


3 Russian bombing of U.S.-backed forces being di... Factual News
Secretary ... 2017

4 Britain says window to restore Northern Irelan... BELFAST (Reuters) - Northern Ireland s politic... September 4, 2017 Factual News

In [12]: 1 data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198 entries, 0 to 197
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 198 non-null object
1 text 198 non-null object
2 date 198 non-null object
3 fake_or_factual 198 non-null object
dtypes: object(4)
memory usage: 6.3+ KB

2 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [13]: 1 # plot number of fake and factual articles


2 data['fake_or_factual'].value_counts().plot(kind='bar', color=default_plot_colour)
3 plt.title('Count of Article Classification')
4 plt.ylabel('# of Articles')
5 plt.xlabel('Classification')

Out[13]: Text(0.5, 0, 'Classification')

3 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [ ]: 1

In [ ]: 1

In [ ]: 1

Import packages required for processing and analysis

4 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [14]: 1 !pip install vaderSentiment


2 import seaborn as sns
3 import spacy
4 from spacy import displacy
5 from spacy import tokenizer
6 import re
7 import nltk
8 from nltk.tokenize import word_tokenize
9 from nltk.stem import PorterStemmer, WordNetLemmatizer
10 from nltk.corpus import stopwords
11 from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
12 import gensim
13 import gensim.corpora as corpora
14 from gensim.models.coherencemodel import CoherenceModel
15 from gensim.models import LsiModel, TfidfModel
16 from sklearn.feature_extraction.text import TfidfVectorizer
17 from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
18 from sklearn.model_selection import train_test_split
19 from sklearn.linear_model import LogisticRegression, SGDClassifier
20 from sklearn.metrics import accuracy_score, classification_report

Requirement already satisfied: vaderSentiment in /usr/local/lib/python3.10/dist-packages (3.3.2)


Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from vaderSen
timent) (2.31.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-package
s (from requests->vaderSentiment) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requ
ests->vaderSentiment) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (fro
m requests->vaderSentiment) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (fro
m requests->vaderSentiment) (2023.7.22)

POS Tagging
In [15]: 1 nlp = spacy.load('en_core_web_sm')

5 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [16]: 1 # split data by fake and factual news


2 fake_news = data[data['fake_or_factual'] == "Fake News"]
3 fact_news = data[data['fake_or_factual'] == "Factual News"]

In [17]: 1 # create spacey documents - use pipe for dataframe


2 fake_spaceydocs = list(nlp.pipe(fake_news['text']))
3 fact_spaceydocs = list(nlp.pipe(fact_news['text']))

In [18]: 1 # create function to extract tags for each document in our data
2 def extract_token_tags(doc:spacy.tokens.doc.Doc):
3 return [(i.text, i.ent_type_, i.pos_) for i in doc]

In [19]: 1 # tag fake dataset


2 fake_tagsdf = []
3 columns = ["token", "ner_tag", "pos_tag"]
4
5 for ix, doc in enumerate(fake_spaceydocs):
6 tags = extract_token_tags(doc)
7 tags = pd.DataFrame(tags)
8 tags.columns = columns
9 fake_tagsdf.append(tags)
10
11 fake_tagsdf = pd.concat(fake_tagsdf)
12
13 # tag factual dataset
14 fact_tagsdf = []
15
16 for ix, doc in enumerate(fact_spaceydocs):
17 tags = extract_token_tags(doc)
18 tags = pd.DataFrame(tags)
19 tags.columns = columns
20 fact_tagsdf.append(tags)
21
22 fact_tagsdf = pd.concat(fact_tagsdf)

6 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [20]: 1 fake_tagsdf.head()

Out[20]: token ner_tag pos_tag

0 There PRON

1 are VERB

2 two CARDINAL NUM

3 small ADJ

4 problems NOUN

Categorical distributions

2-d categorical distributions

7 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

8 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [21]: 1 # token frequency count (fake)


2 pos_counts_fake = fake_tagsdf.groupby(['token','pos_tag']).size().reset_index(name='counts').sort_val
3 pos_counts_fake.head(10)

Out[21]: token pos_tag counts

28 , PUNCT 1908

7446 the DET 1834

39 . PUNCT 1531

5759 of ADP 922

2661 and CCONJ 875

2446 a DET 804

0 SPACE 795

7523 to PART 767

4915 in ADP 667

5094 is AUX 419

9 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [22]: 1 # token frequency count (fact)


2 pos_counts_fact = fact_tagsdf.groupby(['token','pos_tag']).size().reset_index(name='counts').sort_val
3 pos_counts_fact.head(10)

Out[22]: token pos_tag counts

6169 the DET 1903

15 , PUNCT 1698

22 . PUNCT 1381

4733 of ADP 884

1905 a DET 789

2100 and CCONJ 757

4015 in ADP 672

6230 to PART 660

4761 on ADP 482

5586 said VERB 452

Distributions

Categorical distributions

10 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

2-d distributions

Values

Faceted distributions

11 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [23]: 1 # frequencies of pos tags


2 pos_counts_fake.groupby(['pos_tag'])['token'].count().sort_values(ascending=False).head(10)

Out[23]: pos_tag
NOUN 2597
VERB 1814
PROPN 1657
ADJ 876
ADV 412
NUM 221
PRON 99
ADP 88
AUX 58
SCONJ 54
Name: token, dtype: int64

In [24]: 1 pos_counts_fact.groupby(['pos_tag'])['token'].count().sort_values(ascending=False).head(10)

Out[24]: pos_tag
NOUN 2182
VERB 1535
PROPN 1387
ADJ 753
ADV 271
NUM 203
PRON 81
ADP 70
AUX 44
SCONJ 39
Name: token, dtype: int64

12 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [25]: 1 # dive into diferences in nouns


2 pos_counts_fake[pos_counts_fake.pos_tag == "NOUN"][0:15]

Out[25]: token pos_tag counts

5969 people NOUN 77

7959 women NOUN 55

6204 president NOUN 53

7511 time NOUN 52

8011 year NOUN 44

3134 campaign NOUN 44

4577 government NOUN 41

5208 law NOUN 40

7344 t NOUN 40

8013 years NOUN 40

7157 state NOUN 39

4010 election NOUN 37

5474 media NOUN 36

3639 day NOUN 35

3534 country NOUN 33

13 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [26]: 1 pos_counts_fact[pos_counts_fact.pos_tag == "NOUN"][0:15]

Out[26]: token pos_tag counts

3748 government NOUN 71

6639 year NOUN 64

5927 state NOUN 58

2373 bill NOUN 55

1982 administration NOUN 51

3289 election NOUN 48

5084 president NOUN 47

4804 order NOUN 45

4937 people NOUN 45

2509 campaign NOUN 42

4271 law NOUN 42

6118 tax NOUN 39

5415 reporters NOUN 38

5930 statement NOUN 37

4941 percent NOUN 36

Named Entities
In [27]: 1 # top entities in fake news
2 top_entities_fake = fake_tagsdf[fake_tagsdf['ner_tag'] != ""] \
3 .groupby(['token','ner_tag']).size().reset_index(name='counts') \
4 .sort_values(by='counts', ascending=False)

14 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [28]: 1 # top entities in fact news


2 top_entities_fact = fact_tagsdf[fact_tagsdf['ner_tag'] != ""] \
3 .groupby(['token','ner_tag']).size().reset_index(name='counts') \
4 .sort_values(by='counts', ascending=False)

In [29]: 1 # create custom palette to ensure plots are consistent


2 ner_palette = {
3 'ORG': sns.color_palette("Set2").as_hex()[0],
4 'GPE': sns.color_palette("Set2").as_hex()[1],
5 'NORP': sns.color_palette("Set2").as_hex()[2],
6 'PERSON': sns.color_palette("Set2").as_hex()[3],
7 'DATE': sns.color_palette("Set2").as_hex()[4],
8 'CARDINAL': sns.color_palette("Set2").as_hex()[5],
9 'PERCENT': sns.color_palette("Set2").as_hex()[6]
10 }

15 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [30]: 1 sns.barplot(
2 x = 'counts',
3 y = 'token',
4 hue = 'ner_tag',
5 palette = ner_palette,
6 data = top_entities_fake[0:10],
7 orient = 'h',
8 dodge=False
9 ) \
10 .set(title='Most Common Entities in Fake News')

Out[30]: [Text(0.5, 1.0, 'Most Common Entities in Fake News')]

16 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [ ]: 1

In [ ]: 1

17 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [31]: 1 sns.barplot(
2 x = 'counts',
3 y = 'token',
4 hue = 'ner_tag',
5 palette = ner_palette,
6 data = top_entities_fact[0:10],
7 orient = 'h',
8 dodge=False
9 ) \
10 .set(title='Most Common Entities in Factual News')

Out[31]: [Text(0.5, 1.0, 'Most Common Entities in Factual News')]

18 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [ ]: 1

In [ ]: 1

In [ ]: 1

In [ ]: 1

In [ ]: 1

In [ ]: 1

In [ ]: 1

In [ ]: 1

In [ ]: 1

In [ ]: 1

In [ ]: 1

In [ ]: 1

Text Pre-processing
In [32]: 1 # a lot of the factual news has a location tag at the beginning of the article, let's use regex to re
2 data['text_clean'] = data.apply(lambda x: re.sub(r"^[^-]*-\s*", "", x['text']), axis=1)

19 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [33]: 1 # lowercase
2 data['text_clean'] = data['text_clean'].str.lower()

In [34]: 1 # remove punctuation


2 data['text_clean'] = data.apply(lambda x: re.sub(r"([^\w\s])", "", x['text_clean']), axis=1)

In [36]: 1 # stop words


2 nltk.download('stopwords')
3
4 en_stopwords = stopwords.words('english')
5 print(en_stopwords) # check this against our most frequent n-grams

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'l
l", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "sh
e's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs',
'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am',
'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'o
f', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'befor
e', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'und
er', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', '
any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only',
'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'sho
uld', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn
', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven',
"haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 's
han', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wo
uldn', "wouldn't"]

[nltk_data] Downloading package stopwords to /root/nltk_data...


[nltk_data] Unzipping corpora/stopwords.zip.

In [37]: 1 data['text_clean'] = data['text_clean'].apply(lambda x: ' '.join([word for word in x.split() if

In [39]: 1 # tokenize
2 nltk.download('punkt')
3 data['text_clean'] = data.apply(lambda x: word_tokenize(x['text_clean']), axis=1)

[nltk_data] Downloading package punkt to /root/nltk_data...


[nltk_data] Unzipping tokenizers/punkt.zip.

20 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [41]: 1 # lemmatize
2 nltk.download('wordnet')
3 lemmatizer = WordNetLemmatizer()
4 data["text_clean"] = data["text_clean"].apply(lambda tokens: [lemmatizer.lemmatize(token) for token

[nltk_data] Downloading package wordnet to /root/nltk_data...

In [42]: 1 data.head()

Out[42]: title text date fake_or_factual text_clean

HOLLYWEIRD LIB SUSAN There are two small problems with [yearold, oscarwinning, actress,
0 Dec 30, 2015 Fake News
SARANDON Compares Muslim ... your analogy... described, me...

Elijah Cummings Called Trump Out To Buried in Trump s bonkers [buried, trump, bonkers,
1 April 6, 2017 Fake News
His Face ... interview with New Y... interview, new, york,...

Hillary Clinton Says Half Her Cabinet Women make up over 50 percent [woman, make, 50, percent,
2 April 26, 2016 Fake News
Will Be... of this country,... country, grossly, u...

Russian bombing of U.S.-backed WASHINGTON (Reuters) - U.S. September 18, [u, defense, secretary, jim,
3 Factual News
forces being di... Defense Secretary ... 2017 mattis, said, mon...

Britain says window to restore BELFAST (Reuters) - Northern September 4, [northern, ireland, political,
4 Factual News
Northern Irelan... Ireland s politic... 2017 party, rapidly,...

In [43]: 1 # most common unigrams after preprocessing


2 tokens_clean = sum(data['text_clean'], [])
3 unigrams = (pd.Series(nltk.ngrams(tokens_clean, 1)).value_counts())
4 print(unigrams[:10])

(said,) 560
(trump,) 520
(u,) 255
(state,) 250
(president,) 226
(would,) 210
(one,) 141
(year,) 128
(republican,) 128
(also,) 124
dtype: int64

21 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [ ]: 1

In [ ]: 1

22 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [44]: 1 sns.barplot(x = unigrams.values[:10],


2 y = unigrams.index[:10],
3 orient = 'h',
4 palette=[default_plot_colour])\
5 .set(title='Most Common Unigrams After Preprocessing')

Out[44]: [Text(0.5, 1.0, 'Most Common Unigrams After Preprocessing')]

23 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [ ]: 1

In [ ]: 1

In [45]: 1 # most common bigrams after preprocessing


2 bigrams = (pd.Series(nltk.ngrams(tokens_clean, 2)).value_counts())
3 print(bigrams[:10])

(donald, trump) 92
(united, state) 80
(white, house) 72
(president, donald) 42
(hillary, clinton) 31
(new, york) 31
(image, via) 29
(supreme, court) 29
(official, said) 26
(food, stamp) 24
dtype: int64

Sentiment Analysis
In [46]: 1 # use vader so we also get a neutral sentiment count
2 vader_sentiment = SentimentIntensityAnalyzer()

In [47]: 1 data['vader_sentiment_score'] = data['text'].apply(lambda review: vader_sentiment.polarity_scores

In [48]: 1 # create labels


2 bins = [-1, -0.1, 0.1, 1]
3 names = ['negative', 'neutral', 'positive']
4
5 data['vader_sentiment_label'] = pd.cut(data['vader_sentiment_score'], bins, labels=names)

24 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [49]: 1 data['vader_sentiment_label'].value_counts().plot.bar(color=default_plot_colour)

Out[49]: <Axes: >

In [ ]: 1

25 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [ ]: 1

26 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [50]: 1 sns.countplot(
2 x = 'fake_or_factual',
3 hue = 'vader_sentiment_label',
4 palette = sns.color_palette("hls"),
5 data = data
6 ) \
7 .set(title='Sentiment by News Type')

Out[50]: [Text(0.5, 1.0, 'Sentiment by News Type')]

27 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [ ]: 1

In [ ]: 1

In [ ]: 1

In [ ]: 1

In [ ]: 1

In [ ]: 1

In [ ]: 1

In [ ]: 1

In [ ]: 1

In [ ]: 1

In [ ]: 1

In [ ]: 1

LDA
In [51]: 1 # fake news data vectorization
2 fake_news_text = data[data['fake_or_factual'] == "Fake News"]['text_clean'].reset_index(drop=True
3 dictionary_fake = corpora.Dictionary(fake_news_text)
4 doc_term_fake = [dictionary_fake.doc2bow(text) for text in fake_news_text]

28 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [52]: 1 # generate coherence scores to determine an optimum number of topics


2 coherence_values = []
3 model_list = []
4
5 min_topics = 2
6 max_topics = 11
7
8 for num_topics_i in range(min_topics, max_topics+1):
9 model = gensim.models.LdaModel(doc_term_fake, num_topics=num_topics_i, id2word = dictionary_fake
10 model_list.append(model)
11 coherence_model = CoherenceModel(model=model, texts=fake_news_text, dictionary=dictionary_fake
12 coherence_values.append(coherence_model.get_coherence())
13
14 plt.plot(range(min_topics, max_topics+1), coherence_values)
15 plt.xlabel("Number of Topics")
16 plt.ylabel("Coherence score")
17 plt.legend(("coherence_values"), loc='best')
18 plt.show()

WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing


the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing
the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing
the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing
the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing
the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing
the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing
the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing
the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing
the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing
the number of passes or iterations to improve accuracy

29 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

30 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [53]: 1 # create lda model


2 num_topics_fake = 5
3
4 lda_model_fake = gensim.models.LdaModel(corpus=doc_term_fake,
5 id2word=dictionary_fake,
6 num_topics=num_topics_fake)
7
8 lda_model_fake.print_topics(num_topics=num_topics_fake, num_words=10)

WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing


the number of passes or iterations to improve accuracy

Out[53]: [(0,
'0.009*"trump" + 0.004*"food" + 0.004*"said" + 0.003*"u" + 0.003*"stamp" + 0.003*"state" + 0.00
3*"time" + 0.003*"million" + 0.003*"president" + 0.003*"woman"'),
(1,
'0.011*"trump" + 0.007*"said" + 0.005*"president" + 0.004*"clinton" + 0.004*"one" + 0.004*"tim
e" + 0.003*"obama" + 0.003*"state" + 0.003*"would" + 0.003*"u"'),
(2,
'0.015*"trump" + 0.005*"would" + 0.005*"president" + 0.003*"clinton" + 0.003*"student" + 0.003
*"u" + 0.003*"woman" + 0.003*"people" + 0.003*"one" + 0.003*"year"'),
(3,
'0.010*"trump" + 0.006*"said" + 0.006*"state" + 0.005*"republican" + 0.005*"president" + 0.004
*"clinton" + 0.004*"time" + 0.004*"would" + 0.003*"woman" + 0.003*"people"'),
(4,
'0.008*"trump" + 0.005*"clinton" + 0.004*"state" + 0.004*"one" + 0.004*"u" + 0.004*"would" + 0.
003*"said" + 0.003*"mccain" + 0.003*"people" + 0.003*"official"')]

In [54]: 1 # our topics contain a lot of very similar words, let's try using latent semantic anaysis with tf-idf

TF-IDF & LSA


In [55]: 1 def tfidf_corpus(doc_term_matrix):
2 # create a corpus using tfidf vecotization
3 tfidf = TfidfModel(corpus=doc_term_matrix, normalize=True)
4 corpus_tfidf = tfidf[doc_term_matrix]
5 return corpus_tfidf

31 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [56]: 1 def get_coherence_scores(corpus, dictionary, text, min_topics, max_topics):


2 # generate coherence scores to determine an optimum number of topics
3 coherence_values = []
4 model_list = []
5 for num_topics_i in range(min_topics, max_topics+1):
6 model = LsiModel(corpus, num_topics=num_topics_i, id2word = dictionary)
7 model_list.append(model)
8 coherence_model = CoherenceModel(model=model, texts=text, dictionary=dictionary, coherence
9 coherence_values.append(coherence_model.get_coherence())
10 # plot results
11 plt.plot(range(min_topics, max_topics+1), coherence_values)
12 plt.xlabel("Number of Topics")
13 plt.ylabel("Coherence score")
14 plt.legend(("coherence_values"), loc='best')
15 plt.show()

32 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [57]: 1 # create tfidf representation


2 corpus_tfidf_fake = tfidf_corpus(doc_term_fake)
3 # coherence scores for fake news data
4 get_coherence_scores(corpus_tfidf_fake, dictionary_fake, fake_news_text, min_topics=2, max_topics

33 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [58]: 1 # model for fake news data


2 lsa_fake = LsiModel(corpus_tfidf_fake, id2word=dictionary_fake, num_topics=3)
3 lsa_fake.print_topics()

Out[58]: [(0,
'0.218*"trump" + 0.135*"clinton" + 0.094*"woman" + 0.087*"president" + 0.086*"republican" + 0.0
85*"obama" + 0.084*"party" + 0.083*"school" + 0.081*"said" + 0.079*"time"'),
(1,
'-0.299*"boiler" + -0.253*"room" + -0.250*"acr" + -0.186*"jay" + -0.185*"animal" + -0.176*"epis
ode" + -0.147*"analysis" + -0.122*"dyer" + -0.119*"corner" + -0.119*"spore"'),
(2,
'-0.218*"school" + 0.194*"clinton" + 0.165*"conference" + -0.151*"county" + -0.136*"student" +
0.120*"press" + 0.116*"trump" + 0.112*"hillary" + -0.101*"love" + 0.096*"email"')]

Predict fake or factual news

34 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [59]: 1 data.head()

Out[59]: title text date fake_or_factual text_clean vader_sentiment_score vader_sentiment_label

HOLLYWEIRD LIB
[yearold,
SUSAN There are two
Dec 30, oscarwinning,
0 SARANDON small problems Fake News -0.3660 negative
2015 actress,
Compares Muslim with your analogy...
described, me...
...

[buried, trump,
Elijah Cummings Buried in Trump s
April 6, bonkers,
1 Called Trump Out bonkers interview Fake News -0.8197 negative
2017 interview, new,
To His Face ... with New Y...
york,...

[woman, make,
Hillary Clinton Says Women make up
April 26, 50, percent,
2 Half Her Cabinet over 50 percent of Fake News 0.9779 positive
2016 country, grossly,
Will Be... this country,...
u...

WASHINGTON [u, defense,


Russian bombing
(Reuters) - U.S. September secretary, jim,
3 of U.S.-backed Factual News -0.3400 negative
Defense Secretary 18, 2017 mattis, said,
forces being di...
... mon...

Britain says BELFAST (Reuters) [northern, ireland,


September
4 window to restore - Northern Ireland s Factual News political, party, 0.8590 positive
4, 2017
Northern Irelan... politic... rapidly,...

In [60]: 1 X = [','.join(map(str, l)) for l in data['text_clean']]


2 Y = data['fake_or_factual']

In [61]: 1 # text vectorization - CountVectorizer


2 countvec = CountVectorizer()
3 countvec_fit = countvec.fit_transform(X)
4 bag_of_words = pd.DataFrame(countvec_fit.toarray(), columns = countvec.get_feature_names_out())

In [62]: 1 # split into train and test data


2 X_train, X_test, y_train, y_test = train_test_split(bag_of_words, Y, test_size=0.3)

In [63]: 1 lr = LogisticRegression(random_state=0).fit(X_train, y_train)

35 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [64]: 1 y_pred_lr = lr.predict(X_test)

In [65]: 1 accuracy_score(y_pred_lr, y_test)

Out[65]: 0.8833333333333333

In [66]: 1 print(classification_report(y_test, y_pred_lr))

precision recall f1-score support

Factual News 0.93 0.83 0.88 30


Fake News 0.85 0.93 0.89 30

accuracy 0.88 60
macro avg 0.89 0.88 0.88 60
weighted avg 0.89 0.88 0.88 60

In [67]: 1 svm = SGDClassifier().fit(X_train, y_train)

In [68]: 1 y_pred_svm = svm.predict(X_test)

In [69]: 1 accuracy_score(y_pred_svm, y_test)

Out[69]: 0.8666666666666667

In [70]: 1 print(classification_report(y_test, y_pred_svm))

precision recall f1-score support

Factual News 0.82 0.93 0.87 30


Fake News 0.92 0.80 0.86 30

accuracy 0.87 60
macro avg 0.87 0.87 0.87 60
weighted avg 0.87 0.87 0.87 60

In [ ]: 1

36 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

37 of 37 11/16/23, 1:01 PM

You might also like