0% found this document useful (0 votes)
40 views19 pages

An Enhanced Method For Detecting Fake Ne

The document discusses methods for detecting fake news using machine learning. It analyzes prior work using text classification and response-based detection. The paper then experiments with detecting fake news related words in user responses. Machine learning algorithms like SVM and SGD classifiers achieved over 90% accuracy in identifying fake news.

Uploaded by

alperen.unal89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views19 pages

An Enhanced Method For Detecting Fake Ne

The document discusses methods for detecting fake news using machine learning. It analyzes prior work using text classification and response-based detection. The paper then experiments with detecting fake news related words in user responses. Machine learning algorithms like SVM and SGD classifiers achieved over 90% accuracy in identifying fake news.

Uploaded by

alperen.unal89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Turkish Journal of Physiotherapy and Rehabilitation; 32(3)

ISSN 2651-4451 | e-ISSN 2651-446X

AN ENHANCED METHOD FOR DETECTING FAKE NEWS USING MACHINE


LEARNING

P.Pratima Rani1, D. Chandra Mouli2, Palla Sravani3


1
Asst.Prof, Department of CSE,Vignan’s Institute of Information
Technology(A),Visakhapatnam,A.P,India.
2
Asst.Prof,Department of CSE, Raghu Engineering College, Visakhapatnam,A.P,India.
3
Asst.Prof, Department of CSE,Vignan’s Institute of Engineering for
Women,Visakhapatnam,A.P,India.

ABSTRACT

The problem of fake news has evolved much faster within the latest years. Social media has dramatically
modified its attain and have an impact on as an entire . On one hand, it‟s low cost, and convenient
accessibility with speedy share of knowledge attracts greater interest of humans to read news from it. , it
allows vast unfold of fake news, which are nothing however false data to deceive people. As a result,
automating Fake information detection has emerge as mislead people. On the various hand, it allows vast
unfold of fake news, which are nothing however false data to deceive people. As a result, automating Fake
information detection has emerged as fundamental so as to carry study on-line and social media. AI and
Machine studying are the newest applied sciences to know and obtain obviate the Fake information with the
assist of Algorithms.

In this work, Machine-learning strategies are employed to detect the credibility of news based on the textual
content content material and responses given through users. A contrast is formed to exhibit that the latter is
bigger dependable and high-quality in phrases of identifying all kinds of news. The approach utilized during
this work is extremely best posterior likelihood of tokens within the response of two classes. It makes use of
frequency-based elements to show the Algorithms like Support Vector Machine, Passive Aggressive
Classifier, Multinomial Naïve Bayes, Logistic Regression and Stochastic Gradient Classifier. This work
additionally highlights a wide-range of points found out lately during this location that gives a clearer image
for the automation of this problem.

I even have administered an experiment during this work to suit the lists of fake associated phrases within the
textual content of responses, to get out whether or not the response based totally detection may be a properly
measure to make a decision the credibility or not. The results are found to be very promising and have scope
for extra research within the area. Linear SVM and Stochastic Gradient Classifier algorithm with Tf-Idf
vector accomplished Accuracy and ROC Area underneath curve above 90% and 95% respectively. This work
are often used as a big building block for determining the veracity of fakenews.

Key words: TFIDF – Term Frequency Inverse Document Frequency, SVM – Support Vector Machine,
SGD- Stochastic Gradient Descent PAC: Passive Aggressive, Classifier TP: True Positive, FP: False
Positive, TN:True Negative, FN: False Negative.

I. INTRODUCTION
With the advancement of technology, information is freely accessible to everyone. Internet provides an enormous
amount of data but the credibility of data depends upon many factors. Enormous amount of data is published
daily via online and medium , but it's tough to inform whether the knowledge may be a true or false. It requires a
deep study and analysis of the story, which incorporates checking the facts by assessing the supporting sources,
by finding original source of the knowledge or by checking the credibility of authors etc. The fabricated
information may be a deliberate attempt with the intent so as to damage/favor a corporation , entity or
individual‟s reputation or it are often simply with the motive to realize financially or politically[]. “Fake News” is

www.turkjphysiotherrehabil.org 5736
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X

that the term coined for this type of fabricated information, which misleads people. During the Indian election
campaigns, we discover many such fabricated posts, news articles and morphed pictures circulating on the social
media.

In the recent years, a substantial amount of research has been conducted during this area with satisfactory results.
With the success and growth of AI and Machine Learning, technology has relieved human from extraneous
efforts. Fake news detection using these technologies can save the society from unnecessary chaos and social
unrest.

In this paper, we discuss the way to build a classifier that's ready to predict whether the users claim is fake or real.
This uses machine learning algorithms and tongue processing techniques. Machine learning may be a subset of
AI within the field of computing that always uses statistical techniques to offer computers the power to find out
with data, without being explicitly programmed[3]. Natural –language processing is a neighborhood of
computing and AI concerned with interactions between computers and human (natural) languages, especially the
way to program computers to process and analyze large amounts of tongue data[4].

One of the sooner works[5] was supported text classification on article‟s body and headlines. the disadvantage of
this approach is that tokens, which are determined with higher posterior probability in two classes, doesn't
necessarily be categorized as important words of these classes because Fake news are often well written with
tokens that appeared as important ones in Real class. Hence, a simpler approach is that if higher posterior
probability is employed on responses given by the users instead of body‟s article.

Social media is employed for rapidly spreading false news lately . A famous quote from Wiston Churchill goes by
“A lie gets halfway round the world before the reality features a chance to urge its pants on.” With an outsized
size of active users on social media, the rumors/fake stories spread sort of a wildfire. Response on such quite
news can convince be a clincher to term the news as “fake” or “real”. User provides evidences within the sort of
multimedia or web links to support or deny the claim. Classification supported this approach would be significant
step during this direction. To support this argument, I performed an experiment associated with the occurrence of
fakerelated words within the collection of responses. Section 6.2.2 discusses about this experiment.

II. LITERATURE SURVEY:


Research on fake news detection may be a recent phenomenon and is gaining importance everyday due of its
huge negative impact on social and civic engagement. during this section, I even have reviewed a number of the
published works during this area.

Impact of fake News:


Wang et al. [] in his journal says that the plague of fake news not only creates lack of trust in journalism but also
turbulence in political world. Fake news influences people‟s decisions regarding whom to vote for during
elections. consistent with the researchers at the Oxford Internet Institute, within the run up to 2016 US
Presidential election, Fake news was prevalent and spread rapidly with the assistance of social media bots[16] . A
social bot refers to an account on social media that's programmed to supply content and interact with humans or
other malicious bots[6] . Studies reveal that these bots influenced the election online discussions largely[1]. Fake
news hinders serious media coverage and makes it harder for journalists to hide important news stories[7]. An
analysis done by Buzzfeed revealed that the highest 20 Fake news stories about the 2016 US Presidential election
received more attention on Facebook than the highest 20 election stories from 19 major media outlets[8].

Deaths are frequently caused by fake news. People are physically attacked over fabricated stories spread on the
social media. In Myanmar, the people of Rohingya were arrested, jailed, and in some cases even raped and killed
due to Fake news [9]. These attempts seem to possess created real world fears and have affected the civic
engagement and community conversations.

Combating Fake News through Machine Learning:


Combating fake news may be a difficult task. To accomplish whether a news story may be a fake by checking the
reality of every fact manually is not any cakewalk because the reality of the facts exists in continuum and
depends heavily upon the nuances of human language, which are difficult to parse in true/false dichotomies [10].
Sloppy writing with grammatical mistakes, may suggest the article isn't written by any journalist and may

www.turkjphysiotherrehabil.org 5737
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X

probably be false. The news published/ broad casted by a unknown media house or newspaper can possibly be
Fake news but these factors don't give assurance and thus definitions and kinds of fake news must be properly
understood and categorized.

III. DEFINITIONS AND ITS TYPES:


Fake news is news that's intentionally and verifiably false and has the potential to mislead viewers/readers. There
are two important dimensions of this definition: “intention” and “authenticity”. First, fake news propagates
misinformation which will be verified. Second, Fake news is made with dishonest intention to mislead public.
This definition is widely adopted in recent research analysis [11; 12; 13; 14]. generally , Fake news are often
categorized into three groups. In first group - “Actual Fake News”, we will put those sorts of news, which are
false and made up by the author of the article. The second group –“Fake news that's actually satire” is made
purely to amuse instead of mislead its audience. Therefore, intentionally misleading and deceptive fake news is
different from obvious satire or parody. The third group is “Poorly reported news that matches an agenda”. this
sort of stories has some real content but isn't entirely correct and is meant especially for a few political
propaganda.

Many researchers have streamlined the kinds of fake news to simplify their research. as an example , consistent
with definitions given by [1], there are a couple of sorts of news that can't be called as “Fake”-

(1) Satire news having proper context. (2) Misinformation that's created unintentionally. (3) Conspiracy theories
those are difficult to place in true/false dichotomies. Kai Shu has presented two main aspects of fakenews
detection problem: “characterization” and “detection”.

Fake news foundations

People who tend to believe their perceptions of reality as only accurate view can believe fake news as true. They
think that those that afflict them are biased and irrational [15]. Also, people that like better to receive news that
confirm their existing belief and views are mostly biased [16], while others are people that are socially conscious
and choose a safer side while consuming and discriminating news following the norms of the community, albeit
the news shared is Fake. These psychological and social human behavioral patterns are the 2 main foundations of
fake news within the Traditional media. along side these two factors, malicious twitter bots is the foundations of
fake news in Social media [1].

IV. RELATED WORK:


According to various researches conducted during this area, fake news detection methods comprise of 4 basic
types – Knowledge Based, Style based, Stance based and Visual based. This section elucidates research
altogether these sorts of detection methods and a couple of other important researches that have received higher
recognition. It also presents some of the important features that were used recently in various research papers to
work out the credibility of stories . The feature extraction is that the crucial phase of Machine learning.

1. Fake News Detection Methods:


Knowledge Based Detection: It aims to use external sources to fact-check the claims made within the news
content. Two typical external sources are open web and knowledge graph. Open web sources are compared to the
claims in terms of consistency and frequency[18], [19], whereas Knowledge graph is employed to see whether
the claims are often inferred from existing facts in graph or not [20], [21], [22]. Many fact-checking websites
(For eg. AltNews, Snopes, Smhoaxslayer, Boomlive) are using domain experts to work out manually the news
veracity. Facebook has recently partnered with Indian fact- checking agency Boomlive to identify false news
circulation on its website [23]. a drag concerning this method is automated fact-checking which is related to
classification of sentences into non-factual, unimportant factual and check-worthy factual statements [24], [27].

Style Based Detection: Style based detection focuses on the way the content has been presented to the users. Fake
news is usually not written by journalist, that being said the design of writing might differ [9]. Song Feng has
implemented deep syntax models using PCFG (Probabilistic Context Free Grammars) to rework sentences into
rules like lexicalized/un-lexicalized production rules and grandparent rules, which describes syntax structure for
deception detection. William Yang Wang implemented deep network models - Convolutional neural networks
(CNN) to see the veracity of stories. Fake articles sometimes show extreme behavior in favor of a party. This sort

www.turkjphysiotherrehabil.org 5738
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X

of literary genre is named as hyper-partisan styles [39]. Linguistic based features are often applied to see this type
of literary genre. In a number of article‟s headlines, there's only enough information to form readers curious to
travel to a particular website or video. This sort of eye-catching headlines or web links is named as click-bait
headlines [1], which may be a source of faux news.

Style based methods also covers methods which finds out tokens with higher posterior probability in two classes,
using word embedding features. Sohan Mone used Naïve Bayes algorithm to get tokens that were found to be
most indicative on the classification and used it for deep learning and logistic regression. They combined the
hypothesis obtained from Naïve Bayes, SVM and Logistic regression and observed the typical accuracy of 83%
on their training set. Although writing styles can largely contribute to detecting fake news but it seem to be less
efficient because, Fake news are often written during a style almost like that of real news [10].

Stance based detection: This method compares how a series of posts on social media or a gaggle of reputable
sources feels about the claim -Agree, Disagree, Neutral or is Unrelated. In [10], the authors used lexical also as
similarity features fed through a Multi-Layer Perceptron (MLP) with one hidden layer to detect the stance of the
articles. They hard-coded reputation score feature (Table 2.1) of varied sources supported nationwide research
studies. Their model achieved 82% accuracy for pure stance detection on their dataset. Martin Potthast used
“wisdom of crowd” feature to enhance news verification by discovering conflicting viewpoints in micro blogs
with the assistance of topic model method - Latent Dirichlet Allocation (LDA). Their overall news veracity
accuracy reached up to 84%.

Visual Based Detection on Social Media: Digitally altered images are everywhere circulating on social media sort
of a wildfire. Photoshop are often used freely lately to switch images adequately enough to fool people into
thinking they're seeing the important picture. the sector of multimedia forensics has produced a substantial
number of methods for tampering detection in videos[40] and pictures . There also are few basic techniques on
the online for general people to identify photo-shopped images for e.g. Google‟s reverse image search, Get image
metadata etc. Andrew Ward has extracted many visual and statistical based features (shown in Table 2.1) which
will be utilized in detecting the authenticity of the multimedia.

Other related works: [3] implemented Document similarity analysis, that calculates the Jaccard similarity, a
widely used similarity measure, between a news , in test set with every news in Fake news training set „F‟ and
real news training set „R‟. The results obtained were very promising. RaymondS Nickerson have exploited the
diffusion patterns of data to detect the hoaxes. Many research papers have used differentlinguistic and word
embedding features. the foremost common ones are tf-idf, word2vec, punctuations, ngrams, PCFG.

Figure 3.1 Flowchart of the method.

www.turkjphysiotherrehabil.org 5739
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X

3.1 DATA COLLECTION AND ANALYSIS


We can get online news from different sources like social media websites, search engine, homepage of news
agency websites or the fact-checking websites. On the web , there are a couple of publicly available datasets for
Fake news classification like Buzz-feed News, LIAR, BS Detector, CRED BANK [1] etc.

These datasets are widely utilized in different research papers for determining the veracity of news. within the
following sections, I even have discussed in short about the sources of the dataset utilized in this work.

Analysis of two publicly available dataset


In the first phase of this project, which was style based detection on the content/body of the news article; I used
two different datasets of varying length and trained the model on each of them. Below are two datasets, which I
used.

LIAR: A Benchmark dataset for Fake news detection [32,34]

The original dataset contained 13 columns for train, test and validation files. The training set included 12,386
human-label short statements, sampled from news releases, TV or radio interviews, campaign speeches etc. the
info was collected from a Fact- checking website PolitiFact through its API.

For implementation of First phase, I chose only training set file and a couple of columns from this file for
classification. the opposite columns are often added later to reinforce the performance.

Below were the columns that were used:

www.turkjphysiotherrehabil.org 5740
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X

Column1: Statement Column2: Label (True/False)

Classes (labels) were grouped such within the newly created dataset you discover only two labels
(True/False) as compared to 6 present within the original. They were grouped as below:

True -- True

Mostly-true -- True

Half-true -- True

Barely-true -- False

False -- False

Pants-fire -- False

Another dataset obtained from Github [33]

The dataset contained four columns:

i). URL

ii). Headline

iii). Body

iv). Label

This dataset contained 4335 news articles with long body text as compared to short texts within the previous
dataset. Average word count of the body within the dataset was 576 words per article. Label was mentioned as
0/1; 0 for Fake news and 1 for Real News. Classification models were trained on this dataset and therefore the
performance of the models were compared and best model was chosen. After analyzing this dataset, I found that,
fake news were mainly taken from few international fake news websites like beforeitsnews.com,
dailybuzzlive.com, activistpost.com etc., similarly the important news were covered from few main lead
newspapers like reuters.com etc.

Data collection and Analysis for Proposed method


For the proposed method, which is predicated on Response of the users, I found that none of the publicly
available datasets contained Responses. I assembled the specified data from Social media websites Twitter and
Facebook. there have been two main steps of this Data acquisition process:

1 Gathering the Fake and Real news

2 Extracting the Comments and other attributes

Gathering the Fake and Real News:

Fake news collection: I used fact-checking websites in India for this purpose. AltNews.com, Smhoaxslayer.com,
Boomlive.com are a number of the agencies, which are authentic and recognized for busting Fake news [BBC]. I
analyzed the articles posted by them debunking that Fake news. I looked just for the relevant data needed for the
development of the dataset. The relevant data were especially Tweets and Facebook post by different users,
which were busted by Fact-checking agencies as Fake. All the Fake Twitter and Facebook posts url were
collected within the initial phase.

Real news collection: This was the better task. I gathered posts/tweets of few reputed news agencies, media news
journalists and even some verified users and groups. I picked the news, which carried strong sentiments (negative

www.turkjphysiotherrehabil.org 5741
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X

also as positive), seeking higher attention but were real. Thus, the dataset created, held resemblance between
Fake & Real news in term of gathering attention. This was in fact a big step to live the performance of the model,
because responses to the news with negative sentiment can make users believe that it's Fake.

Total 132 news items were collected for the dataset, out of which 69 were classified as Fake news and 63 as Real
news. I intentionally chose to stay the amount of stories items less but gathered sizable amount response thereon
news. I picked only those posts on which considerable amount of responses got . The dataset consisted of 5
columns

- “users claim”, “post/tweet”, “url”, “comments” and “label”.

Extracting the responses

For each urls of the posts collected, I extracted the comments for the respective posts using Web Scrapping tools
in Python – Selenium and delightful Soup. With Selenium, we will extract the server version of the page content.
Beautiful Soup library on the opposite hand, cannot roll in the hay because it scrapes data from client version of
the page. Therefore, Selenium along side Beautiful soup was wont to scrape the specified data.

I chose first five to 6 pages of loaded comments to stay the text neither too long nor too short. For convenience,
the language of the responses collected was made restricted to English. Facebook features a function called as
“Translate all” that converts all the comments to English in one go. In twitter, any Non- English comments need
to be translated one by one. Thus, I scrapped comments that were in English or those sentences constructed using
English alphabets.

V. STEPS OF METHOD IMPLEMENTATION


Text preparation
Social media data is much unstructured – majority of them are informal communication with typos, slangs and
bad-grammar etc. to realize better insights, it's necessary to wash the info before it are often used for predictive
modeling. For this purpose, basic pre-processing was done on the News training data. This step was comprised
of--

1 Conversion to Lower case: initiative was to rework the text into small letter , just to avoid multiple copies
of an equivalent words. For e.g. while finding the word count, “Response” and “response” is taken as
different words.

2 Removal of Punctuations: Punctuations doesn't have much significance while treating the text data.
Therefore, removing them helps to scale back the dimensions of overall text.

3 Stop-words removal: Stop-words are the foremost commonly occurring used words during a corpus.
These are for e.g. a, the, of, on, at etc. they typically define the structure of a text and not the context. If
treated as feature, they might end in poor performance. Therefore, Stop-words were faraway from the
training data because the a part of text cleaning process.

4 Tokenization: It refers to dividing the text into a sequence of words or group of words like bigram,
trigram etc. Tokenization was done in order that frequency- based vectors values might be obtained for
these tokens.

5 Lemmatization: It converts the words into its word root. With the assistance of a vocabulary, it does
morphological analysis to select up the basis word. during this work, Lemmatization was performed to
enhance the values of frequency-based vectors.

6 Text pre-processing was an important step before the info was ready for analysis. A noise free corpus
features a reduced size of the sample space for features thereby leading to increased accuracy.

Feature generation

www.turkjphysiotherrehabil.org 5742
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X

We can use text data to get variety of features like word count, frequency of huge words, frequency of unique
words, n-grams etc. By creating a representation of words that capture their meanings, semantic relationships,
and various sorts of context they're utilized in , we will enable computer to know text and perform Clustering,
Classification etc. For this purpose, Word Embedding techniques are wont to convert text into numbers or
vectors, in order that computer can process them.

Word Embedding: A word-embedding format generally tries to map a word to a vector employing a dictionary.
the subsequent frequency based word embedding vectors was used for training the info . they're also categorized
into Linguistic based features.

Count Vector as a feature


Count Vector may be a matrix notation of the dataset, during which rows represent the documents within the
corpus, columns represent a term from the corpus, and cells represent the count of that specific term during a
particular document. The dictionary is made using the list of unique tokens or words within the corpus.

Example: allow us to consider three documents during a corpus C, i.e. D1, D2 and D3 containing the text as
below:

D1: it had been raining heavily yesterday.

D2: inclemency caused heavy rainfall in London.

D3: Yesterday, London newspapers warned of heavy rainfall.

The dictionary are often created with unique words. The unique words identified are: [Rain, Heavy, Yesterday,
Bad, Weather, London, Newspapers, Warned]

No of Documents D = 3 No of Unique words N = 8

Count Matrix represents the occurrence of each term in every document. The Count matrix M = 3 X 8.

A column are often called a word vector for the corresponding word within the Matrix M. Word vector for
“Yesterday” is [1,0,1]. Count vector outputs all those words or tokens from the very best frequency within the
Corpus to rock bottom frequency. For e.g. Rain, Heavy has the very best occurrence within the Corpus, in order
that they lead the glossary within the dictionary. This feature was used for the proposed method to offer the
machine learning models concept which words do social media users often use once they see a Fake or Real
news.

TF-IDF vectors as a feature:

TF-IDF weight represents the relative importance of a term within the document and full corpus.

TF stands for Term Frequency: It calculates how frequently a term appears during a document. Since, every
document size varies, a term may appear more during a long sized document that a brief one. Thus, the length of
the document often divides Term frequency.

IDF stands for Inverse Document Frequency: A word isn't of much use if it's presentin all the documents. Certain
terms like “a”, “an”, “the”, “on”, “of” etc. appear repeatedly during a document but are of little importance. IDF
weighs down the importance of those terms and increase the importance of rare ones. The more the worth of IDF,
the more unique is that the word.

TF-IDF – Term Frequency-Inverse Document Frequency: TF-IDF works by penalizing the foremost commonly
occurring words by assigning them less weightage while giving high weightage to terms, which are present
within the proper subset of the corpus, and has high occurrence during a particular document. it's the merchandise
of Term Frequency and Inverse Document Frequency.

www.turkjphysiotherrehabil.org 5743
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X

TF-IDF may be a widely used feature for text classification. additionally , TF-IDF Vectors are often calculated at
different levels i.e. Word level and N-gram level, which I even have utilized in this project.

i). Word level TF-IDF: Calculates score for each single term in several documents.

ii). N-gram level TF-IDF: Calculates score for the mixture of N terms together in several documents.
Algorithms used for classification.

iii). This section deals with training the classifier. Different classifiers were investigated to predict the
category of the text. I explored specifically five different machine- learning algorithms – Multinomial
Naïve Bayes Passive Aggressive Classifier, Logistic regression, Linear Support Vector machines and
Stochastic Gradient Descent. The implementations of those classifiers were done using Python library
Sci-Kit Learn.

Brief introduction to the algorithms


Naïve Bayes: This classification technique is predicated on Bayes theorem, which assumes that the presence of a
specific feature during a class is independent of the presence of the other feature. It provides way for calculating
the posterior probability.

Passive Aggressive Classifier: The Passive Aggressive Algorithm is a web algorithm; ideal for classifying
massive streams of knowledge (e.g. twitter). it's easy to implement and really fast. It works by taking an example,
learning from it then throwing it away.

Logistic Regression: Logistic regression may be a classification algorithm, wont to predict the probability of
occurrence of an occasion (0/1, True/False, Yes/No). It uses sigmoid function to estimate probabilities.

Support Vector Machine: during this algorithm, each data item is plotted as some extent in n- dimensional space
(n is that the number of features). Values of every feature are the worth of every co-ordinate. It specifically
extracts a absolute best hyper-plane or a group of hyper- planes during a high dimensional space that segregates
two classes. Linear kernel was used for SVM during this work.

Stochastic Gradient Descent: A SGD algorithm starts at a random point, updates the value function with each of
the iteration using one datum at a time and builds a classifier with progressively higher accuracy given an
outsized dataset. In SGD, a sample of coaching set or one training value is employed to calculate parameters,
which are much faster than other gradient descent.

Metrics won’t to access the Performance of Model


In this section, I even have explored a number of the foremost significant metrics by which a machine learning
model performance is measured. These metrics measures how well our model is in a position to classify or
evaluate predictions. The below metrics introduction were utilized in this project.

Classification Accuracy: it's the foremost common evaluation metric for classification problems. it's defined
because the number of correct predication as against the amount of total predictions. However, this metric alone
cannot give enough information to make a decision whether the model may be a good one or not. it's suitable
when there are equal numbers of observation in every class.

Area under ROC-curve: Area under ROC curve may be a performance metric used for binary classifications. It
tells a model‟s ability to disseminate between the 2 classes. If the world under curve or AUC is 1.0 then, it means
it's made all predictions correctly whereas the AUC of 0.5 is sweet because the random predictions. ROC are
often further classified into Sensitivity and Specificity. A binary Classification problem may be a tradeoff
between these two factors.Sensitivity: it's called as “Recall” and is defined as number of instances from the
positive class that are literally predicted correctly. This phenomenon is named as True Positive Rate. During this
work, “Fake” was selected as positive class and “Real” as negative.

Specificity: it's the amount of instances within the negative class that are literally predicted correctly. it's called as
True negative rate.

www.turkjphysiotherrehabil.org 5744
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X

Confusion Matrix: it's also referred to as Error matrix, which may be a table representation that shows the
performance of the model. it's special quite Contingency table having two dimensions- “actual”, labeled on x-axis
and “predicted” on y-axis. The cells of the table are the amount of predictions made by the algorithm.

True Positives: It is correctly predicted positive values.

True Negatives: It is correctly predicted negative values.

False Positives: It is incorrectly predicted negative values as positive values.

False Negatives: It is incorrectly predicted negative values as positive values.

Classification Report: Scikit-learn provides a convenience report when performing on classification problems
which outputs precision, recall, F1 score and support for every class.

Precision: Precision is that the ratio of correctly predicted positive instances to the entire predicted positive
instances. High precision means low False Positive rate.

Recall (Sensitivity): Recall is that the ratio of correctly predicted positive instances to the all instances in actual
class – Yes.

F1-Score: it's the weighted average of Precision and Recall. Therefore, it takes into consideration both false
positives and false negatives. F1 score is typically more useful than accuracy, especially when there's uneven
class distribution. Accuracy performs best if false positives and false negatives have similar instances or cost. If
the value of false positives and false negatives differs widely, then it's better to seem at both Precision and Recall.

VI. EXPERIMENT, RESULTS & ANALYSIS


Experiments were performed using the above algorithms using Vector features- Count Vectors and Tf-Idf vectors
at Word level and Ngram-level. Accuracy was noted for all models. I used K-fold cross validation technique to
enhance the effectiveness of the models. within the First phase of my experiment, I applied text classification on
the articles body in two different publicly available datasets [][]. within the second phase, Experiment was
performed on the responses collected on a group of faux news and Real news claims extracted from Twitter and
Facebook.

Dataset split using K-fold cross validation

This cross-validation technique was used for splitting the dataset randomly into k-folds. (k-1) folds was used for
building the model while kth fold was wont to check the effectiveness of the model. This was repeated until each
of the k-folds served because the test set. I used 3-fold cross validation for this experiment where 67% of the info
is employed for training the model and remaining 33% for testing.

Set of Experiments Conducted

Experiment (Proposed method)

Responses were classified using Count Vector and Tf-Idf vector at two levels: Word level – Single word was
chosen as token for this experiment.

N-gram level – I kept the range of N-gram from 1 to three i.e. from one word to at the most 3words (bigram,
trigram), which was considered as token and experiment was performed.

Maximum document frequency was also utilized in this experiment as a parameter with Tf-Idf vector. This
parameter removed all those tokens that appeared in say X% of the Responses. Initially X was set to 0 i.e. no
parameter was set but later X was increased with step “0.1” i.e. 10%, and therefore the results were noted down.

Classification Accuracy at Word Level

www.turkjphysiotherrehabil.org 5745
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X

Classification Accuracy at Word level performed better than N-gram level as we will see from the above tables.
The accuracy for Multinomial Naïve Bayeswith Tf-Idf at N-gram level was rock bottom at 77.3% while Linear
SVM, Stochastic Gradient Descent and Passive Aggressive Classifier, using Tf-Idf vectors performed well at
both levels and their accuracy was above 90%. Since, Classification accuracy alone isn't sufficient to determine
the effectiveness of the model; other metrics was also explored especially for these three algorithms at word
level, using Tf-Idf Vectors. In another experiment, I included the MDM Parameters described above. With the
rise of MDM from 0 to 1 in step of 0.1, classification accuracy of the three models increased significantly as
depicted by the table below.

Classification Accuracy using MDM (X = 0 to 1) in step of 0.1

Best performing model was Linear SVM with 93.2% at MDM(X = 0.7) and shut to the current model was
Stochastic Gradient Descent and Passive Aggressive Classifier with 92.4%. Beyond, 0.7 the algorithms didn't
show improvement. So, MDM with

was chosen as optimal value.

Henceforth, I obtained the Classification reports including precision, recall, f- score of all three models at
MDM(X=0.7)

Classification Error: It means overall, how often the model is inaccurate, also called as Misclassification Rate.

Classification Error for Linear SVM-TFIDF = 100– 93.2 = 6.8%

Precision value for Linear SVM-TFIDF at 94% is above SGD-TFIDF, which is 93% and Recall values
(Sensitivity) was calculated as 92% for both models.

Confusion Matrix Linear SVM-TFIDF

TP = 16, TN =24, FP = 1, FN = 3

Figure 5.1 Confusion Matrix for Linear SVM-TFIDF, Split 1

TP = 24, TN =18, FP = 1, FN = 1

www.turkjphysiotherrehabil.org 5746
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X

Figure 5.2 Confusion Matrix for Linear SVM-TFIDF, Split 2

TP = 24, TN =17, FP = 2, FN = 1

Figure 5.3 Confusion Matrix for Linear SVM-TFIDF, Split 3

SGD-TFIDF

TP = 16, TN =24, FP = 1, FN = 3

www.turkjphysiotherrehabil.org 5747
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X

Figure 5.4 Confusion Matrix for SGD-TFIDF, Split 1

TP = 24, TN =16, FP = 3, FN = 1

Figure 5.5Confusion Matrix for SGD-TFIDF, Split 2

Sensitivity tells how sensitive that is the classifier to detect fake news, while Specificity tells how selective or
specific the model in predicting real news is. Choosing the metric depends on what quite application is
developed. The positive class during this binary classification is class “Fake”. Therefore, Sensitivity should be
higher, because false positives are more acceptable than False negatives in classification problems of such
applications. The sensitivity is high for both the models and has equal value. By optimizing more for Sensitivity,
we will recover results.

By decreasing the edge for predicting fake news, we will increase the Sensitivity of the classifier. This is able to
increase the amount of True Positives. During this work, threshold is about to 0.5 by default but we will adjust it
to extend sensitivity or specificity counting on what we would like.

1. ROC Curve (Receiver Operating Characteristics Curve)


It is how to see how various thresholds affect sensitivity and specificity, without actually changing the edge.
Linear SVM (for three splits):

By decreasing the edge for predicting fake news, we will increase the Sensitivity of the classifier. This is able to
increase the amount of True Positives. During this work, threshold is about to 0.5 by default but we will adjust it
to extend sensitivity or specificity counting on what we would like.

ROC Curve (Receiver Operating Characteristics Curve)

It is how to see how various thresholds affect sensitivity and specificity, without actually changing the edge .
Linear SVM (for three splits):

www.turkjphysiotherrehabil.org 5748
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X

Figure 6.1 ROC Curve for Linear SVM-TFIDF, Split 1

Figure 6.2 ROC Curve for Linear SVM-TFIDF, Split 2

Figure 6.3 ROC Curve for Linear SVM-TFIDF, Split 3

SGD-TFIDF (for three splits)

www.turkjphysiotherrehabil.org 5749
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X

Figure 6.4 ROC Curve for SGD-TFIDF, Split 2

Figure 6.5 ROC Curve for SGD-TFIDF, Split 3

ROC Area under Curve Score

ROC is that the percentage of the ROC plot that's underneath the curve. Higher the worth of AUC better is that
the classifier. AUC is extremely useful when there's high imbalance of classes. AUC Score for both the models is
shown below:

SCHOTASTIC GRADIENT DESCENT-TFIDF_ROC-AUC-SCORE: 95.7%

LinearSVM-TFIDF_ROC-AUC-SCORE: 96.0%

Cross Validation Score for both models:

LinearSVM-TFIDF_CROSS-VAL-SCORE: 97 %

SCHOTASTIC GRADIENT DESCENT-TFIDF_CROSS-VAL-SCORE: 96.9%

From the above experiments and results, it had been concluded that Linear SVM algorithm, using Term-
Frequency Inverse Document Frequency vector (Word level) at maximum document frequency of 0.7, gave the
simplest performance. Finally, it had been chosen because the best model to work out the Veracity of the News.

Other Experiments:

Content (Body of the article) was classified using Count vector and TF-idf vector on datasets with varying length
of text content.

www.turkjphysiotherrehabil.org 5750
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X

Experiment was performed on the 2 publicly available datasets. the primary dataset, contained more news items
but short text length while the second dataset contained less news items and long texts. Accuracy was noted down
for both of them. It was found that accuracy given by models on second dataset was above the primary one.
Experiment to count the amount of faux related words or a mixture of two or more words within the responses
was performed.

This was the foremost useful experiment which proved that Response based detection has significant advantage
over the text based detection on the article‟s body. during this experiment, I calculated the frequency of words
signifying Fakeness for e.g. “Fake news”, “Misinformation”, “Hoax”, “Photo-shopped” etc. within the responses
collected. the overall idea is that if there's more number of such words utilized in a response then that news has
high probability of being Fake. If no such words are present, then that article is most likely a true article.

Experiment to seek out most informative tokens was performed.

It was an incredibly useful experiment, which was performed within the end. It finds out the foremost informative
features / tokens within the collection of responses that affects the news veracity (fake/real).

Most informative tokens for SVC-TFIDF: The image below shows the highest 30 tokens for 3 splits of dataset,
sorted by TFIDF values.

Figure 6.6: Top 30 informative tokens for SVC-TFIDF Split 1

Figure 6.7: Top 30 informative tokens for SVC-TFIDF Split2

www.turkjphysiotherrehabil.org 5751
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X

Figure 6.8: Top 30 informative tokens for SVC-TFIDF Split 2

As clear from the above images, the “fake” word has the utmost negative tf-idf value of -1.85 within the Fake
class. A trivial response by users just in case of fake claims. another important words are:

1 Video: Users generally comments “Fake Video” or “Morphed Video” etc.

2 Tweet: The words like False tweet or Misleading tweet are utilized in response to any Fake tweets.

3 Altnews: This word has tf-idf value as -0.403 and has high influence in determining the fake news.
Altnews.com is fact-checking agency, which busts fake news circulating on Social media. Users in their
response give regard to articles of such agencies debunking the Fake claims. Therefore, this word
appeared within the top 30 important tokens. Other Fact-checking popular agencies are
Smhoaxslayer.com, Snopes.com, Boomlive.com, Politifact.com etc.

4 Check: This word is employed in response when people ask the tweeter to Fact check before Tweeting Or

5 within the sentences like “Please check the facts before posting it.” Etc.

6 Theonion: it's tf-idf values 0.376. it's a well-liked satirical news website.

7 Photo shopped, Photoshop: For any morphed/modified images circulating on social media, users terms it
as Photo shopped images within the response. Therefore, it's high influence.

8 Images: it's used with words like “Fake Images” or “Photo shopped Images” etc.

9 Spread/Spreading: Sentences like “Please don‟t spread misinformation.” or “Why are you spreading this
Fake article?” appear mostly in comments.

www.turkjphysiotherrehabil.org 5752
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X

VII. CONCLUSION
User‟s opinion on social media posts are often well applied to work out the veracity of stories . Dissemination of
faux news on social media is extremely fast and thus this method, can function a basic building block for Fake
news detection. With highest classification accuracy of 93.2%, sensitivity of 92% and ROC AUC score of 97%,
Linear Support Vector machine with Tf-Idf vector served as a far better model as compared to others. during this
work, the classification was performed on small number of stories items. Adding more data to the dataset will test
the consistency of the performance thereby increasing trust of users on the system. additionally , gathering real
news that nearly appears as Fake news will improve the training of the model. More linguistic based features are
often applied on responses to work out the news veracity. Social media plays a crucial role within the news
verification process, however if the news is recent and is published during a few news outlets only within the
beginning, then social media can't be used as a further resource. The shift from traditional media to social media
and fast dissemination of stories , checks this limitation. Therefore, by exploring more social media features in
our experiments, and mixing them we will create an efficient and reliable system for detecting Fake news.

REFERENCES
1 Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, Huan Liu. "Fake News Detection on Social Media", ACM SIGKDD Explorations Newsletter, 2017
2 https://en.wikipedia.org/wiki/Fake_news
3 https://en.wikipedia.org/wiki/Machine_learning
4 https://en.wikipedia.org/wiki/Natural_language_processing
5 FAKE NEWS IDENTIFICATION CS 229: MACHINE LEARNING : GROUP 621 Sohan Mone, Devyani Choudhary, Ayush Singhania
6 Emilio Ferrara, Onur Varol, Clayton Davis, FilippoMenczer, and Alessandro Flammini. The rise of social bots. Communications of the ACM,
59(7):96{104, 2016.
7 Carlos Merlo (2017), "Millonario negocio FAKE NEWS", Univision Noticias
8 Chang, Juju; Lefferman, Jake; Pedersen, Claire; Martz, Geoff (November 29, 2016). "When Fake News Stories Make Real News Headlines".
Nightline. ABCNews.
9 https://www.cjr.org/analysis/facebook-rohingya-myanmar-fake-news.php
10 https://blog.paperspace.com/fake-news-detection/
11 Eni Mustafaraj and Panagiotis Takis Metaxas. The fake news spreading plague: Was it preventable? arXiv preprint arXiv:1703.06988, 2017.
12 Niall J Conroy, Victoria L Rubin, and Yimin Chen. Automatic deception detection: Methods for finding fake news. Proceedings of the Association for
Information Science and Technology.
13 Martin Potthast, Johannes Kiesel, Kevin Reinartz, Janek Bevendor, and Benno Stein. A stylometric in-quiry into hyperpartisan and fake news. arXiv
preprint, arXiv:1702.05638, 2017.
14 David O Klein and Joshua R Wueller. Fake news: A legal perspective. 2017.
15 Andrew Ward, L Ross, E Reed, E Turiel, and T Brown. Naive realism in everyday life: Implications for social conict and misunderstanding. Values
and knowledge, pages 103{135, 1997.
16 RaymondSNickerson. Conrmation bias: A ubiquitous phenomenon in many guises. Review of general psychology, 2(2):175, 1998.
17 Alessandro Bessi and Emilio Ferrara. Social bots distort the 2016 us presidential election online discussion. First Monday, 21(11), 2016
18 Michele Banko, Michael J Cafarella, Stephen Soder-land, Matthew Broadhead, and Oren Etzioni. Open information extraction from the web. In
IJCAI'07.
19 Amr Magdy and Nayer Wanas. Web-based statistical fact checking of textual documents. In Proceedings of the 2nd international workshop on Search
and mining user-generated contents, pages 103{110. ACM, 2010.
20 Giovanni Luca Ciampaglia, Prashant Shiralkar,Luis M Rocha, Johan Bollen, Filippo Menczer, andAlessandro Flammini. Computational fact checking
from knowledge networks. PloS one, 10(6):e0128193,2015.
21 You Wu, Pankaj K Agarwal, Chengkai Li, Jun Yang, and Cong Yu. Toward computational fact-checking. Proceedings of the VLDB Endowment,
7(7):589{600, 2014 [22] Baoxu Shi and Tim Weninger. Fact checking in het-erogeneous information networks. In WWW'16
22 https://www.huffingtonpost.in/2018/04/25/facebook-says-its-fact-checkers-will- stop fake-news-in-the-karnataka-election-well-just-have-to-believe-
them_a_23420278/
23 Christina Boididou, Symeon Papadopoulos, Markos Zampoglou, Lazaros Apostolidis, Olga Papadopoulou, Yiannis Kompatsiaris. "Detection and
visualization of misleading content on Twitter", International Journal of Multimedia Information Retrieval, 2017
24 Cody Buntain, Jennifer Golbeck. "Automatically Identifying Fake News in Popular Twitter Threads", 2017 IEEE International Conference on Smart
Cloud (SmartCloud), 2017
25 Zhiwei Jin, Juan Cao, Yongdong Zhang, Jianshe Zhou, Qi Tian. "Novel Visual and Statistical Image Features for Microblogs News Verification",
IEEE Transactions on Multimedia, 2017
26 Detection and visualization of misleading content on Twitter
27 Christina Boididou1,2 ·Symeon Papadopoulos2·Markos Zampoglou2·Lazaros Apostolidis2· Olga Papadopoulou2·Yiannis Kompatsiaris
28 https://www.analyticsvidya.com
29 https://www.ritchieng.com/machine-learning-evaluate-classification-model/
30 https://machinelearningmastery.com/
31 Automatically Identifying Fake News in Popular Twitter Threads
32 “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection,
33 William Yang Wang
34 https://github.com/likeaj6/FakeBananas
35 https://github.com/nishitpatel01/Fake_News_Detection
36 K.Venkata Rao B.Keerthana,” Sales Prediction on Video Games Using Machine Learning” Journal of Emerging Technologies and Innovative
Research, Vol.6, pg.326-331, 2019.
37 Song Feng, Ritwik Banerjee, and Yejin Choi. Syntactic stylometry for deception detection. In ACL‟12.
38 “Buzzfeednews:2017-12-fake-news-top-50,”https://github.com/BuzzFeedNews/2017-12- fake-news-top-50.

www.turkjphysiotherrehabil.org 5753
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X

39 N. J. Conroy, V. L. Rubin, and Y. Chen, “Automatic deception detection: methods for finding fake news,” Proceedings of the Association for
Information Science and Technology, vol. 52, no. 1, 2015, pp. 1–4.
40 K. Leela Prasad, P. Anusha, M. Srinivasa Rao, K. Venkata Rao.(2019),” A Machine Learning based Preventing the Occurrence of Cyber Bullying
Messages on OSN”, International Journal of Recent Technology and Engineering,8(2),pp. 1861-1865.
41 V. L. Rubin, Y. Chen, and N. J. Conroy, “Deception detection for news: three types of fakes,” Proceedings of the Association for Information Science
and Technology, vol. 52, no. 1, 2015, pp. 1–4.
42 Dong ping Tian et al. A review on image feature extraction and representation techniques. International Journal of Multimedia and Ubiquitous
Engineering, 8(4):385– 396, 2013
43 Local tampering detection in video sequences Paolo Bestagini, Simone Milani, Marco Tagliasacchi, Stefano Tubaro

www.turkjphysiotherrehabil.org 5754

You might also like