0% found this document useful (0 votes)

12 views37 pages

Sentiment Analysis Using NLP

The document discusses using natural language processing techniques like part-of-speech tagging and sentiment analysis to analyze fake and factual news articles. It imports data on fake and factual news articles, explores the data, and imports necessary NLP packages. The goal is to develop a system that can identify fake news by analyzing the text of news articles.

Uploaded by

D. Notiam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views37 pages

Sentiment Analysis Using NLP

Uploaded by

D. Notiam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

Sentiment Analysis Use Case With the implementation of Natural Language

Processing (NLP)
Assume you are employed by a social media business. The increasing quantity of false information circulating on its platform
worries the corporation. You have been tasked with finding out how to spot fake news and developing a system to do so. Together,
let's investigate and tidy up the data before attempting to categorise fabricated vs real news reports. We'll also talk about how we
might present our results to stakeholders and make some charts of our outputs.

Import Data
In [ ]: 1 import pandas as pd
2 import matplotlib.pyplot as plt

In [ ]: 1 # set plot options

2 plt.rcParams['figure.figsize'] = (12, 8)
3 default_plot_colour = "#00bfbf"

In [ ]: 1 from google.colab import drive

2 drive.mount('/content/drive')

Mounted at /content/drive

In [10]: 1 data = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/NLP/fake_news_data.csv")

1 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [11]: 1 data.head()

Out[11]: title text date fake_or_factual

HOLLYWEIRD LIB SUSAN SARANDON Compares There are two small problems with your
0 Dec 30, 2015 Fake News
Muslim ... analogy...

Buried in Trump s bonkers interview with New

1 Elijah Cummings Called Trump Out To His Face ... April 6, 2017 Fake News
Y...

Women make up over 50 percent of this

2 Hillary Clinton Says Half Her Cabinet Will Be... April 26, 2016 Fake News
country,...

WASHINGTON (Reuters) - U.S. Defense September 18,

3 Russian bombing of U.S.-backed forces being di... Factual News
Secretary ... 2017

4 Britain says window to restore Northern Irelan... BELFAST (Reuters) - Northern Ireland s politic... September 4, 2017 Factual News

In [12]: 1 data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198 entries, 0 to 197
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 198 non-null object
1 text 198 non-null object
2 date 198 non-null object
3 fake_or_factual 198 non-null object
dtypes: object(4)
memory usage: 6.3+ KB

2 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [13]: 1 # plot number of fake and factual articles

2 data['fake_or_factual'].value_counts().plot(kind='bar', color=default_plot_colour)
3 plt.title('Count of Article Classification')
4 plt.ylabel('# of Articles')
5 plt.xlabel('Classification')

Out[13]: Text(0.5, 0, 'Classification')

3 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [ ]: 1

Import packages required for processing and analysis

4 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [14]: 1 !pip install vaderSentiment

2 import seaborn as sns
3 import spacy
4 from spacy import displacy
5 from spacy import tokenizer
6 import re
7 import nltk
8 from nltk.tokenize import word_tokenize
9 from nltk.stem import PorterStemmer, WordNetLemmatizer
10 from nltk.corpus import stopwords
11 from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
12 import gensim
13 import gensim.corpora as corpora
14 from gensim.models.coherencemodel import CoherenceModel
15 from gensim.models import LsiModel, TfidfModel
16 from sklearn.feature_extraction.text import TfidfVectorizer
17 from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
18 from sklearn.model_selection import train_test_split
19 from sklearn.linear_model import LogisticRegression, SGDClassifier
20 from sklearn.metrics import accuracy_score, classification_report

Requirement already satisfied: vaderSentiment in /usr/local/lib/python3.10/dist-packages (3.3.2)

Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from vaderSen
timent) (2.31.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-package
s (from requests->vaderSentiment) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requ
ests->vaderSentiment) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (fro
m requests->vaderSentiment) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (fro
m requests->vaderSentiment) (2023.7.22)

POS Tagging
In [15]: 1 nlp = spacy.load('en_core_web_sm')

5 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [16]: 1 # split data by fake and factual news

2 fake_news = data[data['fake_or_factual'] == "Fake News"]
3 fact_news = data[data['fake_or_factual'] == "Factual News"]

In [17]: 1 # create spacey documents - use pipe for dataframe

2 fake_spaceydocs = list(nlp.pipe(fake_news['text']))
3 fact_spaceydocs = list(nlp.pipe(fact_news['text']))

In [18]: 1 # create function to extract tags for each document in our data
2 def extract_token_tags(doc:spacy.tokens.doc.Doc):
3 return [(i.text, i.ent_type_, i.pos_) for i in doc]

6 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [20]: 1 fake_tagsdf.head()

Out[20]: token ner_tag pos_tag

0 There PRON

1 are VERB

2 two CARDINAL NUM

3 small ADJ

4 problems NOUN

Categorical distributions

2-d categorical distributions

7 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

8 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [21]: 1 # token frequency count (fake)

2 pos_counts_fake = fake_tagsdf.groupby(['token','pos_tag']).size().reset_index(name='counts').sort_val
3 pos_counts_fake.head(10)

Out[21]: token pos_tag counts

28 , PUNCT 1908

7446 the DET 1834

39 . PUNCT 1531

5759 of ADP 922

2661 and CCONJ 875

2446 a DET 804

0 SPACE 795

7523 to PART 767

4915 in ADP 667

5094 is AUX 419

9 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [22]: 1 # token frequency count (fact)

2 pos_counts_fact = fact_tagsdf.groupby(['token','pos_tag']).size().reset_index(name='counts').sort_val
3 pos_counts_fact.head(10)

Out[22]: token pos_tag counts

6169 the DET 1903

15 , PUNCT 1698

22 . PUNCT 1381

4733 of ADP 884

1905 a DET 789

2100 and CCONJ 757

4015 in ADP 672

6230 to PART 660

4761 on ADP 482

5586 said VERB 452

Distributions

Categorical distributions

10 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

2-d distributions

Values

Faceted distributions

11 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [23]: 1 # frequencies of pos tags

2 pos_counts_fake.groupby(['pos_tag'])['token'].count().sort_values(ascending=False).head(10)

Out[23]: pos_tag
NOUN 2597
VERB 1814
PROPN 1657
ADJ 876
ADV 412
NUM 221
PRON 99
ADP 88
AUX 58
SCONJ 54
Name: token, dtype: int64

In [24]: 1 pos_counts_fact.groupby(['pos_tag'])['token'].count().sort_values(ascending=False).head(10)

Out[24]: pos_tag
NOUN 2182
VERB 1535
PROPN 1387
ADJ 753
ADV 271
NUM 203
PRON 81
ADP 70
AUX 44
SCONJ 39
Name: token, dtype: int64

12 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [25]: 1 # dive into diferences in nouns

2 pos_counts_fake[pos_counts_fake.pos_tag == "NOUN"][0:15]

Out[25]: token pos_tag counts

5969 people NOUN 77

7959 women NOUN 55

6204 president NOUN 53

7511 time NOUN 52

8011 year NOUN 44

3134 campaign NOUN 44

4577 government NOUN 41

5208 law NOUN 40

7344 t NOUN 40

8013 years NOUN 40

7157 state NOUN 39

4010 election NOUN 37

5474 media NOUN 36

3639 day NOUN 35

3534 country NOUN 33

13 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [26]: 1 pos_counts_fact[pos_counts_fact.pos_tag == "NOUN"][0:15]

Out[26]: token pos_tag counts

3748 government NOUN 71

6639 year NOUN 64

5927 state NOUN 58

2373 bill NOUN 55

1982 administration NOUN 51

3289 election NOUN 48

5084 president NOUN 47

4804 order NOUN 45

4937 people NOUN 45

2509 campaign NOUN 42

4271 law NOUN 42

6118 tax NOUN 39

5415 reporters NOUN 38

5930 statement NOUN 37

4941 percent NOUN 36

Named Entities
In [27]: 1 # top entities in fake news
2 top_entities_fake = fake_tagsdf[fake_tagsdf['ner_tag'] != ""] \
3 .groupby(['token','ner_tag']).size().reset_index(name='counts') \
4 .sort_values(by='counts', ascending=False)

14 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [28]: 1 # top entities in fact news

2 top_entities_fact = fact_tagsdf[fact_tagsdf['ner_tag'] != ""] \
3 .groupby(['token','ner_tag']).size().reset_index(name='counts') \
4 .sort_values(by='counts', ascending=False)

In [29]: 1 # create custom palette to ensure plots are consistent

2 ner_palette = {
3 'ORG': sns.color_palette("Set2").as_hex()[0],
4 'GPE': sns.color_palette("Set2").as_hex()[1],
5 'NORP': sns.color_palette("Set2").as_hex()[2],
6 'PERSON': sns.color_palette("Set2").as_hex()[3],
7 'DATE': sns.color_palette("Set2").as_hex()[4],
8 'CARDINAL': sns.color_palette("Set2").as_hex()[5],
9 'PERCENT': sns.color_palette("Set2").as_hex()[6]
10 }

15 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [30]: 1 sns.barplot(
2 x = 'counts',
3 y = 'token',
4 hue = 'ner_tag',
5 palette = ner_palette,
6 data = top_entities_fake[0:10],
7 orient = 'h',
8 dodge=False
9 ) \
10 .set(title='Most Common Entities in Fake News')

Out[30]: [Text(0.5, 1.0, 'Most Common Entities in Fake News')]

16 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [ ]: 1

17 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [31]: 1 sns.barplot(
2 x = 'counts',
3 y = 'token',
4 hue = 'ner_tag',
5 palette = ner_palette,
6 data = top_entities_fact[0:10],
7 orient = 'h',
8 dodge=False
9 ) \
10 .set(title='Most Common Entities in Factual News')

Out[31]: [Text(0.5, 1.0, 'Most Common Entities in Factual News')]

18 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [ ]: 1

Text Pre-processing
In [32]: 1 # a lot of the factual news has a location tag at the beginning of the article, let's use regex to re
2 data['text_clean'] = data.apply(lambda x: re.sub(r"^[^-]*-\s*", "", x['text']), axis=1)

19 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [33]: 1 # lowercase
2 data['text_clean'] = data['text_clean'].str.lower()

In [34]: 1 # remove punctuation

2 data['text_clean'] = data.apply(lambda x: re.sub(r"([^\w\s])", "", x['text_clean']), axis=1)

In [36]: 1 # stop words

2 nltk.download('stopwords')
3
4 en_stopwords = stopwords.words('english')
5 print(en_stopwords) # check this against our most frequent n-grams

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'l
l", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "sh
e's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs',
'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am',
'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'o
f', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'befor
e', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'und
er', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', '
any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only',
'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'sho
uld', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn
', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven',
"haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 's
han', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wo
uldn', "wouldn't"]

[nltk_data] Downloading package stopwords to /root/nltk_data...

[nltk_data] Unzipping corpora/stopwords.zip.

In [37]: 1 data['text_clean'] = data['text_clean'].apply(lambda x: ' '.join([word for word in x.split() if

In [39]: 1 # tokenize
2 nltk.download('punkt')
3 data['text_clean'] = data.apply(lambda x: word_tokenize(x['text_clean']), axis=1)

[nltk_data] Downloading package punkt to /root/nltk_data...

[nltk_data] Unzipping tokenizers/punkt.zip.

20 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [41]: 1 # lemmatize
2 nltk.download('wordnet')
3 lemmatizer = WordNetLemmatizer()
4 data["text_clean"] = data["text_clean"].apply(lambda tokens: [lemmatizer.lemmatize(token) for token

[nltk_data] Downloading package wordnet to /root/nltk_data...

In [42]: 1 data.head()

Out[42]: title text date fake_or_factual text_clean

HOLLYWEIRD LIB SUSAN There are two small problems with [yearold, oscarwinning, actress,
0 Dec 30, 2015 Fake News
SARANDON Compares Muslim ... your analogy... described, me...

Elijah Cummings Called Trump Out To Buried in Trump s bonkers [buried, trump, bonkers,
1 April 6, 2017 Fake News
His Face ... interview with New Y... interview, new, york,...

Hillary Clinton Says Half Her Cabinet Women make up over 50 percent [woman, make, 50, percent,
2 April 26, 2016 Fake News
Will Be... of this country,... country, grossly, u...

Russian bombing of U.S.-backed WASHINGTON (Reuters) - U.S. September 18, [u, defense, secretary, jim,
3 Factual News
forces being di... Defense Secretary ... 2017 mattis, said, mon...

Britain says window to restore BELFAST (Reuters) - Northern September 4, [northern, ireland, political,
4 Factual News
Northern Irelan... Ireland s politic... 2017 party, rapidly,...

In [43]: 1 # most common unigrams after preprocessing

2 tokens_clean = sum(data['text_clean'], [])
3 unigrams = (pd.Series(nltk.ngrams(tokens_clean, 1)).value_counts())
4 print(unigrams[:10])

(said,) 560
(trump,) 520
(u,) 255
(state,) 250
(president,) 226
(would,) 210
(one,) 141
(year,) 128
(republican,) 128
(also,) 124
dtype: int64

21 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [ ]: 1

22 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [44]: 1 sns.barplot(x = unigrams.values[:10],

2 y = unigrams.index[:10],
3 orient = 'h',
4 palette=[default_plot_colour])\
5 .set(title='Most Common Unigrams After Preprocessing')

Out[44]: [Text(0.5, 1.0, 'Most Common Unigrams After Preprocessing')]

23 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [ ]: 1

In [45]: 1 # most common bigrams after preprocessing

2 bigrams = (pd.Series(nltk.ngrams(tokens_clean, 2)).value_counts())
3 print(bigrams[:10])

(donald, trump) 92
(united, state) 80
(white, house) 72
(president, donald) 42
(hillary, clinton) 31
(new, york) 31
(image, via) 29
(supreme, court) 29
(official, said) 26
(food, stamp) 24
dtype: int64

Sentiment Analysis
In [46]: 1 # use vader so we also get a neutral sentiment count
2 vader_sentiment = SentimentIntensityAnalyzer()

In [47]: 1 data['vader_sentiment_score'] = data['text'].apply(lambda review: vader_sentiment.polarity_scores

In [48]: 1 # create labels

2 bins = [-1, -0.1, 0.1, 1]
3 names = ['negative', 'neutral', 'positive']
4
5 data['vader_sentiment_label'] = pd.cut(data['vader_sentiment_score'], bins, labels=names)

24 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [49]: 1 data['vader_sentiment_label'].value_counts().plot.bar(color=default_plot_colour)

Out[49]: <Axes: >

In [ ]: 1

25 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [ ]: 1

26 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [50]: 1 sns.countplot(
2 x = 'fake_or_factual',
3 hue = 'vader_sentiment_label',
4 palette = sns.color_palette("hls"),
5 data = data
6 ) \
7 .set(title='Sentiment by News Type')

Out[50]: [Text(0.5, 1.0, 'Sentiment by News Type')]

27 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [ ]: 1

LDA
In [51]: 1 # fake news data vectorization
2 fake_news_text = data[data['fake_or_factual'] == "Fake News"]['text_clean'].reset_index(drop=True
3 dictionary_fake = corpora.Dictionary(fake_news_text)
4 doc_term_fake = [dictionary_fake.doc2bow(text) for text in fake_news_text]

28 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [52]: 1 # generate coherence scores to determine an optimum number of topics

2 coherence_values = []
3 model_list = []
4
5 min_topics = 2
6 max_topics = 11
7
8 for num_topics_i in range(min_topics, max_topics+1):
9 model = gensim.models.LdaModel(doc_term_fake, num_topics=num_topics_i, id2word = dictionary_fake
10 model_list.append(model)
11 coherence_model = CoherenceModel(model=model, texts=fake_news_text, dictionary=dictionary_fake
12 coherence_values.append(coherence_model.get_coherence())
13
14 plt.plot(range(min_topics, max_topics+1), coherence_values)
15 plt.xlabel("Number of Topics")
16 plt.ylabel("Coherence score")
17 plt.legend(("coherence_values"), loc='best')
18 plt.show()

WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing

the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing
the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing
the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing
the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing
the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing
the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing
the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing
the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing
the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing
the number of passes or iterations to improve accuracy

29 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

30 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [53]: 1 # create lda model

2 num_topics_fake = 5
3
4 lda_model_fake = gensim.models.LdaModel(corpus=doc_term_fake,
5 id2word=dictionary_fake,
6 num_topics=num_topics_fake)
7
8 lda_model_fake.print_topics(num_topics=num_topics_fake, num_words=10)

WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing

the number of passes or iterations to improve accuracy

Out[53]: [(0,
'0.009*"trump" + 0.004*"food" + 0.004*"said" + 0.003*"u" + 0.003*"stamp" + 0.003*"state" + 0.00
3*"time" + 0.003*"million" + 0.003*"president" + 0.003*"woman"'),
(1,
'0.011*"trump" + 0.007*"said" + 0.005*"president" + 0.004*"clinton" + 0.004*"one" + 0.004*"tim
e" + 0.003*"obama" + 0.003*"state" + 0.003*"would" + 0.003*"u"'),
(2,
'0.015*"trump" + 0.005*"would" + 0.005*"president" + 0.003*"clinton" + 0.003*"student" + 0.003
*"u" + 0.003*"woman" + 0.003*"people" + 0.003*"one" + 0.003*"year"'),
(3,
'0.010*"trump" + 0.006*"said" + 0.006*"state" + 0.005*"republican" + 0.005*"president" + 0.004
*"clinton" + 0.004*"time" + 0.004*"would" + 0.003*"woman" + 0.003*"people"'),
(4,
'0.008*"trump" + 0.005*"clinton" + 0.004*"state" + 0.004*"one" + 0.004*"u" + 0.004*"would" + 0.
003*"said" + 0.003*"mccain" + 0.003*"people" + 0.003*"official"')]

In [54]: 1 # our topics contain a lot of very similar words, let's try using latent semantic anaysis with tf-idf

TF-IDF & LSA

In [55]: 1 def tfidf_corpus(doc_term_matrix):
2 # create a corpus using tfidf vecotization
3 tfidf = TfidfModel(corpus=doc_term_matrix, normalize=True)
4 corpus_tfidf = tfidf[doc_term_matrix]
5 return corpus_tfidf

31 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [56]: 1 def get_coherence_scores(corpus, dictionary, text, min_topics, max_topics):

2 # generate coherence scores to determine an optimum number of topics
3 coherence_values = []
4 model_list = []
5 for num_topics_i in range(min_topics, max_topics+1):
6 model = LsiModel(corpus, num_topics=num_topics_i, id2word = dictionary)
7 model_list.append(model)
8 coherence_model = CoherenceModel(model=model, texts=text, dictionary=dictionary, coherence
9 coherence_values.append(coherence_model.get_coherence())
10 # plot results
11 plt.plot(range(min_topics, max_topics+1), coherence_values)
12 plt.xlabel("Number of Topics")
13 plt.ylabel("Coherence score")
14 plt.legend(("coherence_values"), loc='best')
15 plt.show()

32 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [57]: 1 # create tfidf representation

2 corpus_tfidf_fake = tfidf_corpus(doc_term_fake)
3 # coherence scores for fake news data
4 get_coherence_scores(corpus_tfidf_fake, dictionary_fake, fake_news_text, min_topics=2, max_topics

33 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [58]: 1 # model for fake news data

2 lsa_fake = LsiModel(corpus_tfidf_fake, id2word=dictionary_fake, num_topics=3)
3 lsa_fake.print_topics()

Out[58]: [(0,
'0.218*"trump" + 0.135*"clinton" + 0.094*"woman" + 0.087*"president" + 0.086*"republican" + 0.0
85*"obama" + 0.084*"party" + 0.083*"school" + 0.081*"said" + 0.079*"time"'),
(1,
'-0.299*"boiler" + -0.253*"room" + -0.250*"acr" + -0.186*"jay" + -0.185*"animal" + -0.176*"epis
ode" + -0.147*"analysis" + -0.122*"dyer" + -0.119*"corner" + -0.119*"spore"'),
(2,
'-0.218*"school" + 0.194*"clinton" + 0.165*"conference" + -0.151*"county" + -0.136*"student" +
0.120*"press" + 0.116*"trump" + 0.112*"hillary" + -0.101*"love" + 0.096*"email"')]

Predict fake or factual news

34 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [59]: 1 data.head()

Out[59]: title text date fake_or_factual text_clean vader_sentiment_score vader_sentiment_label

HOLLYWEIRD LIB
[yearold,
SUSAN There are two
Dec 30, oscarwinning,
0 SARANDON small problems Fake News -0.3660 negative
2015 actress,
Compares Muslim with your analogy...
described, me...
...

[buried, trump,
Elijah Cummings Buried in Trump s
April 6, bonkers,
1 Called Trump Out bonkers interview Fake News -0.8197 negative
2017 interview, new,
To His Face ... with New Y...
york,...

[woman, make,
Hillary Clinton Says Women make up
April 26, 50, percent,
2 Half Her Cabinet over 50 percent of Fake News 0.9779 positive
2016 country, grossly,
Will Be... this country,...
u...

WASHINGTON [u, defense,

Russian bombing
(Reuters) - U.S. September secretary, jim,
3 of U.S.-backed Factual News -0.3400 negative
Defense Secretary 18, 2017 mattis, said,
forces being di...
... mon...

Britain says BELFAST (Reuters) [northern, ireland,

September
4 window to restore - Northern Ireland s Factual News political, party, 0.8590 positive
4, 2017
Northern Irelan... politic... rapidly,...

In [60]: 1 X = [','.join(map(str, l)) for l in data['text_clean']]

2 Y = data['fake_or_factual']

In [61]: 1 # text vectorization - CountVectorizer

2 countvec = CountVectorizer()
3 countvec_fit = countvec.fit_transform(X)
4 bag_of_words = pd.DataFrame(countvec_fit.toarray(), columns = countvec.get_feature_names_out())

In [62]: 1 # split into train and test data

2 X_train, X_test, y_train, y_test = train_test_split(bag_of_words, Y, test_size=0.3)

In [63]: 1 lr = LogisticRegression(random_state=0).fit(X_train, y_train)

35 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

In [64]: 1 y_pred_lr = lr.predict(X_test)

In [65]: 1 accuracy_score(y_pred_lr, y_test)

Out[65]: 0.8833333333333333

In [66]: 1 print(classification_report(y_test, y_pred_lr))

precision recall f1-score support

Factual News 0.93 0.83 0.88 30

Fake News 0.85 0.93 0.89 30

accuracy 0.88 60
macro avg 0.89 0.88 0.88 60
weighted avg 0.89 0.88 0.88 60

In [67]: 1 svm = SGDClassifier().fit(X_train, y_train)

In [68]: 1 y_pred_svm = svm.predict(X_test)

In [69]: 1 accuracy_score(y_pred_svm, y_test)

Out[69]: 0.8666666666666667

In [70]: 1 print(classification_report(y_test, y_pred_svm))

precision recall f1-score support

Factual News 0.82 0.93 0.87 30

Fake News 0.92 0.80 0.86 30

accuracy 0.87 60
macro avg 0.87 0.87 0.87 60
weighted avg 0.87 0.87 0.87 60

In [ ]: 1

36 of 37 11/16/23, 1:01 PM
Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

37 of 37 11/16/23, 1:01 PM

Detecting of Fake News With Python and ML
57% (7)
Detecting of Fake News With Python and ML
17 pages
Across Cultures Ocr
No ratings yet
Across Cultures Ocr
162 pages
Sentiment Analysis On User-Generated Tweets
No ratings yet
Sentiment Analysis On User-Generated Tweets
15 pages
Natural Language Processing Assignment
No ratings yet
Natural Language Processing Assignment
3 pages
Sentiment Analysis of Social Media with Python _ by Haaya Naushan _ Towards Data Science
No ratings yet
Sentiment Analysis of Social Media with Python _ by Haaya Naushan _ Towards Data Science
9 pages
Twitter Sentiment Analysis Dss
No ratings yet
Twitter Sentiment Analysis Dss
14 pages
Methodology (Autosaved)
No ratings yet
Methodology (Autosaved)
9 pages
NLP - Twitter Sentiment Analysis With Tensorflow - Sebastian Correa - Medium
No ratings yet
NLP - Twitter Sentiment Analysis With Tensorflow - Sebastian Correa - Medium
13 pages
Sentiment Analysis
100% (1)
Sentiment Analysis
19 pages
Template For The First Slide of PPT Presentation1
No ratings yet
Template For The First Slide of PPT Presentation1
18 pages
Part C - Assignment No. 2 Mini-Project On Twitter
No ratings yet
Part C - Assignment No. 2 Mini-Project On Twitter
7 pages
Part C - Assignment No. 2 Mini-Project On Twitter
No ratings yet
Part C - Assignment No. 2 Mini-Project On Twitter
7 pages
Tweet-Sentiment-Extraction - Exploratory Data Analysis
No ratings yet
Tweet-Sentiment-Extraction - Exploratory Data Analysis
11 pages
Fake News Researchpaper
No ratings yet
Fake News Researchpaper
4 pages
Ppt- Sentiment Analysis Using Machine Learning Algorithms
No ratings yet
Ppt- Sentiment Analysis Using Machine Learning Algorithms
23 pages
How To Perform Sentiment Analysis in Python 3 Using The Natural Language Toolkit (NLTK) - DigitalOcean
No ratings yet
How To Perform Sentiment Analysis in Python 3 Using The Natural Language Toolkit (NLTK) - DigitalOcean
29 pages
Twitter Sentiment Analysis Using Classifiers: Prepared By: Guide
No ratings yet
Twitter Sentiment Analysis Using Classifiers: Prepared By: Guide
19 pages
Cyber Security: PROJECT: Fake News Detection
No ratings yet
Cyber Security: PROJECT: Fake News Detection
8 pages
Built_an_NLP_model_to_detect_fake_news_accurately__1746681940
No ratings yet
Built_an_NLP_model_to_detect_fake_news_accurately__1746681940
96 pages
Fake News Detection Project
No ratings yet
Fake News Detection Project
7 pages
Project Report
No ratings yet
Project Report
12 pages
ML Report Fake News Detection
No ratings yet
ML Report Fake News Detection
15 pages
IC-RTETM_Final_Sentiment_Analysis
No ratings yet
IC-RTETM_Final_Sentiment_Analysis
13 pages
NLP_A2 (2)
No ratings yet
NLP_A2 (2)
7 pages
Fake News Detection
No ratings yet
Fake News Detection
15 pages
Session 7
No ratings yet
Session 7
17 pages
Fakenews
No ratings yet
Fakenews
5 pages
ML Sentimentanalysis
No ratings yet
ML Sentimentanalysis
5 pages
46_Beyond_Tech_DravidianLangTe
No ratings yet
46_Beyond_Tech_DravidianLangTe
5 pages
Combine PDF
No ratings yet
Combine PDF
124 pages
nlp_essentials
No ratings yet
nlp_essentials
22 pages
The Main Objective Is To Detect The Fake News, Which Is A Classic Text Classification
No ratings yet
The Main Objective Is To Detect The Fake News, Which Is A Classic Text Classification
57 pages
More Than Sentiments
No ratings yet
More Than Sentiments
6 pages
fin_ijprems1714118825
No ratings yet
fin_ijprems1714118825
6 pages
Pysentimiento: A Python Toolkit For Sentiment Analysis and Socialnlp Tasks
No ratings yet
Pysentimiento: A Python Toolkit For Sentiment Analysis and Socialnlp Tasks
4 pages
Twitter Sentiment Analysis (NLP) : This Photo CC By-Nc
100% (1)
Twitter Sentiment Analysis (NLP) : This Photo CC By-Nc
18 pages
14 SentimentClassification
No ratings yet
14 SentimentClassification
23 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
11 pages
Twitte Analysis
No ratings yet
Twitte Analysis
53 pages
Sentiment Analysis of Twitter
No ratings yet
Sentiment Analysis of Twitter
26 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
9 pages
Ascertaining Public Opinion Through Sentiment Analysis
No ratings yet
Ascertaining Public Opinion Through Sentiment Analysis
5 pages
EXP5
No ratings yet
EXP5
15 pages
Artificial Neural Network Proposal
No ratings yet
Artificial Neural Network Proposal
5 pages
NLP
No ratings yet
NLP
45 pages
Maneesha Nidigonda Verzeo Major Project
No ratings yet
Maneesha Nidigonda Verzeo Major Project
11 pages
Maneesha Nidigonda Major Project
No ratings yet
Maneesha Nidigonda Major Project
11 pages
Tejas NLP Report
No ratings yet
Tejas NLP Report
8 pages
Prediction of Election Result by Enhanced Sentiment Analysis On Twiter Data PDF
No ratings yet
Prediction of Election Result by Enhanced Sentiment Analysis On Twiter Data PDF
4 pages
15 SentimentAnalysis
No ratings yet
15 SentimentAnalysis
17 pages
OTCON4_template[1]
No ratings yet
OTCON4_template[1]
11 pages
1729401471516
No ratings yet
1729401471516
98 pages
Sentiment__Analysis
No ratings yet
Sentiment__Analysis
12 pages
E21cseu0561 Lab3
No ratings yet
E21cseu0561 Lab3
10 pages
Towards Robust Models for Fake News Detection in Spanish- Gómez González, Coll Ardanuy y Rosso
No ratings yet
Towards Robust Models for Fake News Detection in Spanish- Gómez González, Coll Ardanuy y Rosso
13 pages
Fake Phase3
No ratings yet
Fake Phase3
14 pages
Twitter Sentiment Analysis: (Corona Virus)
No ratings yet
Twitter Sentiment Analysis: (Corona Virus)
12 pages
Lec.4 SDA (2023-2024) .FCDS
No ratings yet
Lec.4 SDA (2023-2024) .FCDS
18 pages
AI - Phase 4
No ratings yet
AI - Phase 4
11 pages
AAT Cover Page
No ratings yet
AAT Cover Page
17 pages
Data Fun Facts
From Everand
Data Fun Facts
Ravi Nakamoto
No ratings yet
Lecture 2 - Language and Politics (New)
No ratings yet
Lecture 2 - Language and Politics (New)
18 pages
CV Arun
No ratings yet
CV Arun
4 pages
PAST SIMPLE or PAST PERFECT
No ratings yet
PAST SIMPLE or PAST PERFECT
1 page
LANG211-Foreign-Lang
No ratings yet
LANG211-Foreign-Lang
11 pages
Tripura Current Affairs (16-30) November 2024
No ratings yet
Tripura Current Affairs (16-30) November 2024
2 pages
Mts Exam1
No ratings yet
Mts Exam1
2 pages
CSE 3666 Homework 2
No ratings yet
CSE 3666 Homework 2
7 pages
Your CLIL Change Science
No ratings yet
Your CLIL Change Science
7 pages
Conditional Sentences
No ratings yet
Conditional Sentences
4 pages
history sem 4(1200-1500
No ratings yet
history sem 4(1200-1500
1 page
Integrated English
No ratings yet
Integrated English
158 pages
Soal Bahasa Inggris Kelas 8
100% (1)
Soal Bahasa Inggris Kelas 8
3 pages
Dave J. Jackson: Programming 1&2
No ratings yet
Dave J. Jackson: Programming 1&2
2 pages
Download ebooks file (Ebook) White Kids: Language, Race, and Styles of Youth Identity by Mary Bucholtz ISBN 9780521871495, 0521871492 all chapters
100% (2)
Download ebooks file (Ebook) White Kids: Language, Race, and Styles of Youth Identity by Mary Bucholtz ISBN 9780521871495, 0521871492 all chapters
74 pages
Icse Resul 20190508081952103
No ratings yet
Icse Resul 20190508081952103
31 pages
Completing Sentences:: 1. Complete The Sentences Using Suitable Clauses/phrases. .5x10 5
67% (3)
Completing Sentences:: 1. Complete The Sentences Using Suitable Clauses/phrases. .5x10 5
11 pages
2st Periodical Test in English
No ratings yet
2st Periodical Test in English
3 pages
Exams First Certificate
No ratings yet
Exams First Certificate
3 pages
Natural Language Processing (NLP) Untuk: Mengetahui Hukum Bacaan Al-Qur'An
No ratings yet
Natural Language Processing (NLP) Untuk: Mengetahui Hukum Bacaan Al-Qur'An
10 pages
Techniques For Paraphrasing
No ratings yet
Techniques For Paraphrasing
4 pages
For Anne Gregory Summary, Explanation, Word meanings Class 10
No ratings yet
For Anne Gregory Summary, Explanation, Word meanings Class 10
2 pages
101 Syllabus
No ratings yet
101 Syllabus
11 pages
Ravi
No ratings yet
Ravi
2 pages
Effective Supply Teaching Behaviour Management Classroom Discipline and Colleague Support 1st Edition Bill Rogers - Quickly download the ebook to start your content journey
No ratings yet
Effective Supply Teaching Behaviour Management Classroom Discipline and Colleague Support 1st Edition Bill Rogers - Quickly download the ebook to start your content journey
75 pages
Answer Key: Cumulative Test
No ratings yet
Answer Key: Cumulative Test
2 pages
PRESENTACIÒN EN POWER POINT Futuro Simple
No ratings yet
PRESENTACIÒN EN POWER POINT Futuro Simple
5 pages
Untitled
No ratings yet
Untitled
2 pages
Liye - Info Vatsayana Kamasutra Telugu Bo Baynote Dayviews PR
No ratings yet
Liye - Info Vatsayana Kamasutra Telugu Bo Baynote Dayviews PR
1 page
AP Question - 2
No ratings yet
AP Question - 2
7 pages

Sentiment Analysis Using NLP

Uploaded by

Sentiment Analysis Using NLP

Uploaded by

Sentiment_Analysis_Using_NLP - Jupyter Notebook http://localhost:8888/notebooks/Documents/datascienceonecampus/365%20data%20science/S...

Sentiment Analysis Use Case With the implementation of Natural Language

In [ ]: 1 # set plot options

In [ ]: 1 from google.colab import drive

In [10]: 1 data = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/NLP/fake_news_data.csv")

Out[11]: title text date fake_or_factual

Buried in Trump s bonkers interview with New

Women make up over 50 percent of this

WASHINGTON (Reuters) - U.S. Defense September 18,

In [13]: 1 # plot number of fake and factual articles

Out[13]: Text(0.5, 0, 'Classification')

Import packages required for processing and analysis

In [14]: 1 !pip install vaderSentiment

Requirement already satisfied: vaderSentiment in /usr/local/lib/python3.10/dist-packages (3.3.2)

In [16]: 1 # split data by fake and factual news

In [17]: 1 # create spacey documents - use pipe for dataframe

In [19]: 1 # tag fake dataset

Out[20]: token ner_tag pos_tag

2 two CARDINAL NUM

2-d categorical distributions

In [21]: 1 # token frequency count (fake)

Out[21]: token pos_tag counts

7446 the DET 1834

5759 of ADP 922

2661 and CCONJ 875

2446 a DET 804

7523 to PART 767

4915 in ADP 667

5094 is AUX 419

In [22]: 1 # token frequency count (fact)

Out[22]: token pos_tag counts

6169 the DET 1903

4733 of ADP 884

1905 a DET 789

2100 and CCONJ 757

4015 in ADP 672

6230 to PART 660

4761 on ADP 482

5586 said VERB 452

In [23]: 1 # frequencies of pos tags

In [25]: 1 # dive into diferences in nouns

Out[25]: token pos_tag counts

5969 people NOUN 77

7959 women NOUN 55

6204 president NOUN 53

7511 time NOUN 52

8011 year NOUN 44

3134 campaign NOUN 44

4577 government NOUN 41

5208 law NOUN 40

8013 years NOUN 40

7157 state NOUN 39

4010 election NOUN 37

5474 media NOUN 36

3639 day NOUN 35

3534 country NOUN 33

In [26]: 1 pos_counts_fact[pos_counts_fact.pos_tag == "NOUN"][0:15]

Out[26]: token pos_tag counts

3748 government NOUN 71

6639 year NOUN 64

5927 state NOUN 58

2373 bill NOUN 55

1982 administration NOUN 51

3289 election NOUN 48

5084 president NOUN 47

4804 order NOUN 45

4937 people NOUN 45

2509 campaign NOUN 42

4271 law NOUN 42

6118 tax NOUN 39

5415 reporters NOUN 38

5930 statement NOUN 37

4941 percent NOUN 36

In [28]: 1 # top entities in fact news

In [29]: 1 # create custom palette to ensure plots are consistent

Out[30]: [Text(0.5, 1.0, 'Most Common Entities in Fake News')]