An Enhanced Method For Detecting Fake Ne
An Enhanced Method For Detecting Fake Ne
ABSTRACT
The problem of fake news has evolved much faster within the latest years. Social media has dramatically
modified its attain and have an impact on as an entire . On one hand, it‟s low cost, and convenient
accessibility with speedy share of knowledge attracts greater interest of humans to read news from it. , it
allows vast unfold of fake news, which are nothing however false data to deceive people. As a result,
automating Fake information detection has emerge as mislead people. On the various hand, it allows vast
unfold of fake news, which are nothing however false data to deceive people. As a result, automating Fake
information detection has emerged as fundamental so as to carry study on-line and social media. AI and
Machine studying are the newest applied sciences to know and obtain obviate the Fake information with the
assist of Algorithms.
In this work, Machine-learning strategies are employed to detect the credibility of news based on the textual
content content material and responses given through users. A contrast is formed to exhibit that the latter is
bigger dependable and high-quality in phrases of identifying all kinds of news. The approach utilized during
this work is extremely best posterior likelihood of tokens within the response of two classes. It makes use of
frequency-based elements to show the Algorithms like Support Vector Machine, Passive Aggressive
Classifier, Multinomial Naïve Bayes, Logistic Regression and Stochastic Gradient Classifier. This work
additionally highlights a wide-range of points found out lately during this location that gives a clearer image
for the automation of this problem.
I even have administered an experiment during this work to suit the lists of fake associated phrases within the
textual content of responses, to get out whether or not the response based totally detection may be a properly
measure to make a decision the credibility or not. The results are found to be very promising and have scope
for extra research within the area. Linear SVM and Stochastic Gradient Classifier algorithm with Tf-Idf
vector accomplished Accuracy and ROC Area underneath curve above 90% and 95% respectively. This work
are often used as a big building block for determining the veracity of fakenews.
Key words: TFIDF – Term Frequency Inverse Document Frequency, SVM – Support Vector Machine,
SGD- Stochastic Gradient Descent PAC: Passive Aggressive, Classifier TP: True Positive, FP: False
Positive, TN:True Negative, FN: False Negative.
I. INTRODUCTION
With the advancement of technology, information is freely accessible to everyone. Internet provides an enormous
amount of data but the credibility of data depends upon many factors. Enormous amount of data is published
daily via online and medium , but it's tough to inform whether the knowledge may be a true or false. It requires a
deep study and analysis of the story, which incorporates checking the facts by assessing the supporting sources,
by finding original source of the knowledge or by checking the credibility of authors etc. The fabricated
information may be a deliberate attempt with the intent so as to damage/favor a corporation , entity or
individual‟s reputation or it are often simply with the motive to realize financially or politically[]. “Fake News” is
www.turkjphysiotherrehabil.org 5736
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X
that the term coined for this type of fabricated information, which misleads people. During the Indian election
campaigns, we discover many such fabricated posts, news articles and morphed pictures circulating on the social
media.
In the recent years, a substantial amount of research has been conducted during this area with satisfactory results.
With the success and growth of AI and Machine Learning, technology has relieved human from extraneous
efforts. Fake news detection using these technologies can save the society from unnecessary chaos and social
unrest.
In this paper, we discuss the way to build a classifier that's ready to predict whether the users claim is fake or real.
This uses machine learning algorithms and tongue processing techniques. Machine learning may be a subset of
AI within the field of computing that always uses statistical techniques to offer computers the power to find out
with data, without being explicitly programmed[3]. Natural –language processing is a neighborhood of
computing and AI concerned with interactions between computers and human (natural) languages, especially the
way to program computers to process and analyze large amounts of tongue data[4].
One of the sooner works[5] was supported text classification on article‟s body and headlines. the disadvantage of
this approach is that tokens, which are determined with higher posterior probability in two classes, doesn't
necessarily be categorized as important words of these classes because Fake news are often well written with
tokens that appeared as important ones in Real class. Hence, a simpler approach is that if higher posterior
probability is employed on responses given by the users instead of body‟s article.
Social media is employed for rapidly spreading false news lately . A famous quote from Wiston Churchill goes by
“A lie gets halfway round the world before the reality features a chance to urge its pants on.” With an outsized
size of active users on social media, the rumors/fake stories spread sort of a wildfire. Response on such quite
news can convince be a clincher to term the news as “fake” or “real”. User provides evidences within the sort of
multimedia or web links to support or deny the claim. Classification supported this approach would be significant
step during this direction. To support this argument, I performed an experiment associated with the occurrence of
fakerelated words within the collection of responses. Section 6.2.2 discusses about this experiment.
Deaths are frequently caused by fake news. People are physically attacked over fabricated stories spread on the
social media. In Myanmar, the people of Rohingya were arrested, jailed, and in some cases even raped and killed
due to Fake news [9]. These attempts seem to possess created real world fears and have affected the civic
engagement and community conversations.
www.turkjphysiotherrehabil.org 5737
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X
probably be false. The news published/ broad casted by a unknown media house or newspaper can possibly be
Fake news but these factors don't give assurance and thus definitions and kinds of fake news must be properly
understood and categorized.
Many researchers have streamlined the kinds of fake news to simplify their research. as an example , consistent
with definitions given by [1], there are a couple of sorts of news that can't be called as “Fake”-
(1) Satire news having proper context. (2) Misinformation that's created unintentionally. (3) Conspiracy theories
those are difficult to place in true/false dichotomies. Kai Shu has presented two main aspects of fakenews
detection problem: “characterization” and “detection”.
People who tend to believe their perceptions of reality as only accurate view can believe fake news as true. They
think that those that afflict them are biased and irrational [15]. Also, people that like better to receive news that
confirm their existing belief and views are mostly biased [16], while others are people that are socially conscious
and choose a safer side while consuming and discriminating news following the norms of the community, albeit
the news shared is Fake. These psychological and social human behavioral patterns are the 2 main foundations of
fake news within the Traditional media. along side these two factors, malicious twitter bots is the foundations of
fake news in Social media [1].
Style Based Detection: Style based detection focuses on the way the content has been presented to the users. Fake
news is usually not written by journalist, that being said the design of writing might differ [9]. Song Feng has
implemented deep syntax models using PCFG (Probabilistic Context Free Grammars) to rework sentences into
rules like lexicalized/un-lexicalized production rules and grandparent rules, which describes syntax structure for
deception detection. William Yang Wang implemented deep network models - Convolutional neural networks
(CNN) to see the veracity of stories. Fake articles sometimes show extreme behavior in favor of a party. This sort
www.turkjphysiotherrehabil.org 5738
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X
of literary genre is named as hyper-partisan styles [39]. Linguistic based features are often applied to see this type
of literary genre. In a number of article‟s headlines, there's only enough information to form readers curious to
travel to a particular website or video. This sort of eye-catching headlines or web links is named as click-bait
headlines [1], which may be a source of faux news.
Style based methods also covers methods which finds out tokens with higher posterior probability in two classes,
using word embedding features. Sohan Mone used Naïve Bayes algorithm to get tokens that were found to be
most indicative on the classification and used it for deep learning and logistic regression. They combined the
hypothesis obtained from Naïve Bayes, SVM and Logistic regression and observed the typical accuracy of 83%
on their training set. Although writing styles can largely contribute to detecting fake news but it seem to be less
efficient because, Fake news are often written during a style almost like that of real news [10].
Stance based detection: This method compares how a series of posts on social media or a gaggle of reputable
sources feels about the claim -Agree, Disagree, Neutral or is Unrelated. In [10], the authors used lexical also as
similarity features fed through a Multi-Layer Perceptron (MLP) with one hidden layer to detect the stance of the
articles. They hard-coded reputation score feature (Table 2.1) of varied sources supported nationwide research
studies. Their model achieved 82% accuracy for pure stance detection on their dataset. Martin Potthast used
“wisdom of crowd” feature to enhance news verification by discovering conflicting viewpoints in micro blogs
with the assistance of topic model method - Latent Dirichlet Allocation (LDA). Their overall news veracity
accuracy reached up to 84%.
Visual Based Detection on Social Media: Digitally altered images are everywhere circulating on social media sort
of a wildfire. Photoshop are often used freely lately to switch images adequately enough to fool people into
thinking they're seeing the important picture. the sector of multimedia forensics has produced a substantial
number of methods for tampering detection in videos[40] and pictures . There also are few basic techniques on
the online for general people to identify photo-shopped images for e.g. Google‟s reverse image search, Get image
metadata etc. Andrew Ward has extracted many visual and statistical based features (shown in Table 2.1) which
will be utilized in detecting the authenticity of the multimedia.
Other related works: [3] implemented Document similarity analysis, that calculates the Jaccard similarity, a
widely used similarity measure, between a news , in test set with every news in Fake news training set „F‟ and
real news training set „R‟. The results obtained were very promising. RaymondS Nickerson have exploited the
diffusion patterns of data to detect the hoaxes. Many research papers have used differentlinguistic and word
embedding features. the foremost common ones are tf-idf, word2vec, punctuations, ngrams, PCFG.
www.turkjphysiotherrehabil.org 5739
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X
These datasets are widely utilized in different research papers for determining the veracity of news. within the
following sections, I even have discussed in short about the sources of the dataset utilized in this work.
The original dataset contained 13 columns for train, test and validation files. The training set included 12,386
human-label short statements, sampled from news releases, TV or radio interviews, campaign speeches etc. the
info was collected from a Fact- checking website PolitiFact through its API.
For implementation of First phase, I chose only training set file and a couple of columns from this file for
classification. the opposite columns are often added later to reinforce the performance.
www.turkjphysiotherrehabil.org 5740
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X
Classes (labels) were grouped such within the newly created dataset you discover only two labels
(True/False) as compared to 6 present within the original. They were grouped as below:
True -- True
Mostly-true -- True
Half-true -- True
Barely-true -- False
False -- False
Pants-fire -- False
i). URL
ii). Headline
iii). Body
iv). Label
This dataset contained 4335 news articles with long body text as compared to short texts within the previous
dataset. Average word count of the body within the dataset was 576 words per article. Label was mentioned as
0/1; 0 for Fake news and 1 for Real News. Classification models were trained on this dataset and therefore the
performance of the models were compared and best model was chosen. After analyzing this dataset, I found that,
fake news were mainly taken from few international fake news websites like beforeitsnews.com,
dailybuzzlive.com, activistpost.com etc., similarly the important news were covered from few main lead
newspapers like reuters.com etc.
Fake news collection: I used fact-checking websites in India for this purpose. AltNews.com, Smhoaxslayer.com,
Boomlive.com are a number of the agencies, which are authentic and recognized for busting Fake news [BBC]. I
analyzed the articles posted by them debunking that Fake news. I looked just for the relevant data needed for the
development of the dataset. The relevant data were especially Tweets and Facebook post by different users,
which were busted by Fact-checking agencies as Fake. All the Fake Twitter and Facebook posts url were
collected within the initial phase.
Real news collection: This was the better task. I gathered posts/tweets of few reputed news agencies, media news
journalists and even some verified users and groups. I picked the news, which carried strong sentiments (negative
www.turkjphysiotherrehabil.org 5741
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X
also as positive), seeking higher attention but were real. Thus, the dataset created, held resemblance between
Fake & Real news in term of gathering attention. This was in fact a big step to live the performance of the model,
because responses to the news with negative sentiment can make users believe that it's Fake.
Total 132 news items were collected for the dataset, out of which 69 were classified as Fake news and 63 as Real
news. I intentionally chose to stay the amount of stories items less but gathered sizable amount response thereon
news. I picked only those posts on which considerable amount of responses got . The dataset consisted of 5
columns
For each urls of the posts collected, I extracted the comments for the respective posts using Web Scrapping tools
in Python – Selenium and delightful Soup. With Selenium, we will extract the server version of the page content.
Beautiful Soup library on the opposite hand, cannot roll in the hay because it scrapes data from client version of
the page. Therefore, Selenium along side Beautiful soup was wont to scrape the specified data.
I chose first five to 6 pages of loaded comments to stay the text neither too long nor too short. For convenience,
the language of the responses collected was made restricted to English. Facebook features a function called as
“Translate all” that converts all the comments to English in one go. In twitter, any Non- English comments need
to be translated one by one. Thus, I scrapped comments that were in English or those sentences constructed using
English alphabets.
1 Conversion to Lower case: initiative was to rework the text into small letter , just to avoid multiple copies
of an equivalent words. For e.g. while finding the word count, “Response” and “response” is taken as
different words.
2 Removal of Punctuations: Punctuations doesn't have much significance while treating the text data.
Therefore, removing them helps to scale back the dimensions of overall text.
3 Stop-words removal: Stop-words are the foremost commonly occurring used words during a corpus.
These are for e.g. a, the, of, on, at etc. they typically define the structure of a text and not the context. If
treated as feature, they might end in poor performance. Therefore, Stop-words were faraway from the
training data because the a part of text cleaning process.
4 Tokenization: It refers to dividing the text into a sequence of words or group of words like bigram,
trigram etc. Tokenization was done in order that frequency- based vectors values might be obtained for
these tokens.
5 Lemmatization: It converts the words into its word root. With the assistance of a vocabulary, it does
morphological analysis to select up the basis word. during this work, Lemmatization was performed to
enhance the values of frequency-based vectors.
6 Text pre-processing was an important step before the info was ready for analysis. A noise free corpus
features a reduced size of the sample space for features thereby leading to increased accuracy.
Feature generation
www.turkjphysiotherrehabil.org 5742
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X
We can use text data to get variety of features like word count, frequency of huge words, frequency of unique
words, n-grams etc. By creating a representation of words that capture their meanings, semantic relationships,
and various sorts of context they're utilized in , we will enable computer to know text and perform Clustering,
Classification etc. For this purpose, Word Embedding techniques are wont to convert text into numbers or
vectors, in order that computer can process them.
Word Embedding: A word-embedding format generally tries to map a word to a vector employing a dictionary.
the subsequent frequency based word embedding vectors was used for training the info . they're also categorized
into Linguistic based features.
Example: allow us to consider three documents during a corpus C, i.e. D1, D2 and D3 containing the text as
below:
The dictionary are often created with unique words. The unique words identified are: [Rain, Heavy, Yesterday,
Bad, Weather, London, Newspapers, Warned]
Count Matrix represents the occurrence of each term in every document. The Count matrix M = 3 X 8.
A column are often called a word vector for the corresponding word within the Matrix M. Word vector for
“Yesterday” is [1,0,1]. Count vector outputs all those words or tokens from the very best frequency within the
Corpus to rock bottom frequency. For e.g. Rain, Heavy has the very best occurrence within the Corpus, in order
that they lead the glossary within the dictionary. This feature was used for the proposed method to offer the
machine learning models concept which words do social media users often use once they see a Fake or Real
news.
TF-IDF weight represents the relative importance of a term within the document and full corpus.
TF stands for Term Frequency: It calculates how frequently a term appears during a document. Since, every
document size varies, a term may appear more during a long sized document that a brief one. Thus, the length of
the document often divides Term frequency.
IDF stands for Inverse Document Frequency: A word isn't of much use if it's presentin all the documents. Certain
terms like “a”, “an”, “the”, “on”, “of” etc. appear repeatedly during a document but are of little importance. IDF
weighs down the importance of those terms and increase the importance of rare ones. The more the worth of IDF,
the more unique is that the word.
TF-IDF – Term Frequency-Inverse Document Frequency: TF-IDF works by penalizing the foremost commonly
occurring words by assigning them less weightage while giving high weightage to terms, which are present
within the proper subset of the corpus, and has high occurrence during a particular document. it's the merchandise
of Term Frequency and Inverse Document Frequency.
www.turkjphysiotherrehabil.org 5743
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X
TF-IDF may be a widely used feature for text classification. additionally , TF-IDF Vectors are often calculated at
different levels i.e. Word level and N-gram level, which I even have utilized in this project.
i). Word level TF-IDF: Calculates score for each single term in several documents.
ii). N-gram level TF-IDF: Calculates score for the mixture of N terms together in several documents.
Algorithms used for classification.
iii). This section deals with training the classifier. Different classifiers were investigated to predict the
category of the text. I explored specifically five different machine- learning algorithms – Multinomial
Naïve Bayes Passive Aggressive Classifier, Logistic regression, Linear Support Vector machines and
Stochastic Gradient Descent. The implementations of those classifiers were done using Python library
Sci-Kit Learn.
Passive Aggressive Classifier: The Passive Aggressive Algorithm is a web algorithm; ideal for classifying
massive streams of knowledge (e.g. twitter). it's easy to implement and really fast. It works by taking an example,
learning from it then throwing it away.
Logistic Regression: Logistic regression may be a classification algorithm, wont to predict the probability of
occurrence of an occasion (0/1, True/False, Yes/No). It uses sigmoid function to estimate probabilities.
Support Vector Machine: during this algorithm, each data item is plotted as some extent in n- dimensional space
(n is that the number of features). Values of every feature are the worth of every co-ordinate. It specifically
extracts a absolute best hyper-plane or a group of hyper- planes during a high dimensional space that segregates
two classes. Linear kernel was used for SVM during this work.
Stochastic Gradient Descent: A SGD algorithm starts at a random point, updates the value function with each of
the iteration using one datum at a time and builds a classifier with progressively higher accuracy given an
outsized dataset. In SGD, a sample of coaching set or one training value is employed to calculate parameters,
which are much faster than other gradient descent.
Classification Accuracy: it's the foremost common evaluation metric for classification problems. it's defined
because the number of correct predication as against the amount of total predictions. However, this metric alone
cannot give enough information to make a decision whether the model may be a good one or not. it's suitable
when there are equal numbers of observation in every class.
Area under ROC-curve: Area under ROC curve may be a performance metric used for binary classifications. It
tells a model‟s ability to disseminate between the 2 classes. If the world under curve or AUC is 1.0 then, it means
it's made all predictions correctly whereas the AUC of 0.5 is sweet because the random predictions. ROC are
often further classified into Sensitivity and Specificity. A binary Classification problem may be a tradeoff
between these two factors.Sensitivity: it's called as “Recall” and is defined as number of instances from the
positive class that are literally predicted correctly. This phenomenon is named as True Positive Rate. During this
work, “Fake” was selected as positive class and “Real” as negative.
Specificity: it's the amount of instances within the negative class that are literally predicted correctly. it's called as
True negative rate.
www.turkjphysiotherrehabil.org 5744
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X
Confusion Matrix: it's also referred to as Error matrix, which may be a table representation that shows the
performance of the model. it's special quite Contingency table having two dimensions- “actual”, labeled on x-axis
and “predicted” on y-axis. The cells of the table are the amount of predictions made by the algorithm.
Classification Report: Scikit-learn provides a convenience report when performing on classification problems
which outputs precision, recall, F1 score and support for every class.
Precision: Precision is that the ratio of correctly predicted positive instances to the entire predicted positive
instances. High precision means low False Positive rate.
Recall (Sensitivity): Recall is that the ratio of correctly predicted positive instances to the all instances in actual
class – Yes.
F1-Score: it's the weighted average of Precision and Recall. Therefore, it takes into consideration both false
positives and false negatives. F1 score is typically more useful than accuracy, especially when there's uneven
class distribution. Accuracy performs best if false positives and false negatives have similar instances or cost. If
the value of false positives and false negatives differs widely, then it's better to seem at both Precision and Recall.
This cross-validation technique was used for splitting the dataset randomly into k-folds. (k-1) folds was used for
building the model while kth fold was wont to check the effectiveness of the model. This was repeated until each
of the k-folds served because the test set. I used 3-fold cross validation for this experiment where 67% of the info
is employed for training the model and remaining 33% for testing.
Responses were classified using Count Vector and Tf-Idf vector at two levels: Word level – Single word was
chosen as token for this experiment.
N-gram level – I kept the range of N-gram from 1 to three i.e. from one word to at the most 3words (bigram,
trigram), which was considered as token and experiment was performed.
Maximum document frequency was also utilized in this experiment as a parameter with Tf-Idf vector. This
parameter removed all those tokens that appeared in say X% of the Responses. Initially X was set to 0 i.e. no
parameter was set but later X was increased with step “0.1” i.e. 10%, and therefore the results were noted down.
www.turkjphysiotherrehabil.org 5745
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X
Classification Accuracy at Word level performed better than N-gram level as we will see from the above tables.
The accuracy for Multinomial Naïve Bayeswith Tf-Idf at N-gram level was rock bottom at 77.3% while Linear
SVM, Stochastic Gradient Descent and Passive Aggressive Classifier, using Tf-Idf vectors performed well at
both levels and their accuracy was above 90%. Since, Classification accuracy alone isn't sufficient to determine
the effectiveness of the model; other metrics was also explored especially for these three algorithms at word
level, using Tf-Idf Vectors. In another experiment, I included the MDM Parameters described above. With the
rise of MDM from 0 to 1 in step of 0.1, classification accuracy of the three models increased significantly as
depicted by the table below.
Best performing model was Linear SVM with 93.2% at MDM(X = 0.7) and shut to the current model was
Stochastic Gradient Descent and Passive Aggressive Classifier with 92.4%. Beyond, 0.7 the algorithms didn't
show improvement. So, MDM with
Henceforth, I obtained the Classification reports including precision, recall, f- score of all three models at
MDM(X=0.7)
Classification Error: It means overall, how often the model is inaccurate, also called as Misclassification Rate.
Precision value for Linear SVM-TFIDF at 94% is above SGD-TFIDF, which is 93% and Recall values
(Sensitivity) was calculated as 92% for both models.
TP = 16, TN =24, FP = 1, FN = 3
TP = 24, TN =18, FP = 1, FN = 1
www.turkjphysiotherrehabil.org 5746
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X
TP = 24, TN =17, FP = 2, FN = 1
SGD-TFIDF
TP = 16, TN =24, FP = 1, FN = 3
www.turkjphysiotherrehabil.org 5747
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X
TP = 24, TN =16, FP = 3, FN = 1
Sensitivity tells how sensitive that is the classifier to detect fake news, while Specificity tells how selective or
specific the model in predicting real news is. Choosing the metric depends on what quite application is
developed. The positive class during this binary classification is class “Fake”. Therefore, Sensitivity should be
higher, because false positives are more acceptable than False negatives in classification problems of such
applications. The sensitivity is high for both the models and has equal value. By optimizing more for Sensitivity,
we will recover results.
By decreasing the edge for predicting fake news, we will increase the Sensitivity of the classifier. This is able to
increase the amount of True Positives. During this work, threshold is about to 0.5 by default but we will adjust it
to extend sensitivity or specificity counting on what we would like.
By decreasing the edge for predicting fake news, we will increase the Sensitivity of the classifier. This is able to
increase the amount of True Positives. During this work, threshold is about to 0.5 by default but we will adjust it
to extend sensitivity or specificity counting on what we would like.
It is how to see how various thresholds affect sensitivity and specificity, without actually changing the edge .
Linear SVM (for three splits):
www.turkjphysiotherrehabil.org 5748
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X
www.turkjphysiotherrehabil.org 5749
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X
ROC is that the percentage of the ROC plot that's underneath the curve. Higher the worth of AUC better is that
the classifier. AUC is extremely useful when there's high imbalance of classes. AUC Score for both the models is
shown below:
LinearSVM-TFIDF_ROC-AUC-SCORE: 96.0%
LinearSVM-TFIDF_CROSS-VAL-SCORE: 97 %
From the above experiments and results, it had been concluded that Linear SVM algorithm, using Term-
Frequency Inverse Document Frequency vector (Word level) at maximum document frequency of 0.7, gave the
simplest performance. Finally, it had been chosen because the best model to work out the Veracity of the News.
Other Experiments:
Content (Body of the article) was classified using Count vector and TF-idf vector on datasets with varying length
of text content.
www.turkjphysiotherrehabil.org 5750
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X
Experiment was performed on the 2 publicly available datasets. the primary dataset, contained more news items
but short text length while the second dataset contained less news items and long texts. Accuracy was noted down
for both of them. It was found that accuracy given by models on second dataset was above the primary one.
Experiment to count the amount of faux related words or a mixture of two or more words within the responses
was performed.
This was the foremost useful experiment which proved that Response based detection has significant advantage
over the text based detection on the article‟s body. during this experiment, I calculated the frequency of words
signifying Fakeness for e.g. “Fake news”, “Misinformation”, “Hoax”, “Photo-shopped” etc. within the responses
collected. the overall idea is that if there's more number of such words utilized in a response then that news has
high probability of being Fake. If no such words are present, then that article is most likely a true article.
It was an incredibly useful experiment, which was performed within the end. It finds out the foremost informative
features / tokens within the collection of responses that affects the news veracity (fake/real).
Most informative tokens for SVC-TFIDF: The image below shows the highest 30 tokens for 3 splits of dataset,
sorted by TFIDF values.
www.turkjphysiotherrehabil.org 5751
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X
As clear from the above images, the “fake” word has the utmost negative tf-idf value of -1.85 within the Fake
class. A trivial response by users just in case of fake claims. another important words are:
2 Tweet: The words like False tweet or Misleading tweet are utilized in response to any Fake tweets.
3 Altnews: This word has tf-idf value as -0.403 and has high influence in determining the fake news.
Altnews.com is fact-checking agency, which busts fake news circulating on Social media. Users in their
response give regard to articles of such agencies debunking the Fake claims. Therefore, this word
appeared within the top 30 important tokens. Other Fact-checking popular agencies are
Smhoaxslayer.com, Snopes.com, Boomlive.com, Politifact.com etc.
4 Check: This word is employed in response when people ask the tweeter to Fact check before Tweeting Or
5 within the sentences like “Please check the facts before posting it.” Etc.
6 Theonion: it's tf-idf values 0.376. it's a well-liked satirical news website.
7 Photo shopped, Photoshop: For any morphed/modified images circulating on social media, users terms it
as Photo shopped images within the response. Therefore, it's high influence.
8 Images: it's used with words like “Fake Images” or “Photo shopped Images” etc.
9 Spread/Spreading: Sentences like “Please don‟t spread misinformation.” or “Why are you spreading this
Fake article?” appear mostly in comments.
www.turkjphysiotherrehabil.org 5752
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X
VII. CONCLUSION
User‟s opinion on social media posts are often well applied to work out the veracity of stories . Dissemination of
faux news on social media is extremely fast and thus this method, can function a basic building block for Fake
news detection. With highest classification accuracy of 93.2%, sensitivity of 92% and ROC AUC score of 97%,
Linear Support Vector machine with Tf-Idf vector served as a far better model as compared to others. during this
work, the classification was performed on small number of stories items. Adding more data to the dataset will test
the consistency of the performance thereby increasing trust of users on the system. additionally , gathering real
news that nearly appears as Fake news will improve the training of the model. More linguistic based features are
often applied on responses to work out the news veracity. Social media plays a crucial role within the news
verification process, however if the news is recent and is published during a few news outlets only within the
beginning, then social media can't be used as a further resource. The shift from traditional media to social media
and fast dissemination of stories , checks this limitation. Therefore, by exploring more social media features in
our experiments, and mixing them we will create an efficient and reliable system for detecting Fake news.
REFERENCES
1 Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, Huan Liu. "Fake News Detection on Social Media", ACM SIGKDD Explorations Newsletter, 2017
2 https://en.wikipedia.org/wiki/Fake_news
3 https://en.wikipedia.org/wiki/Machine_learning
4 https://en.wikipedia.org/wiki/Natural_language_processing
5 FAKE NEWS IDENTIFICATION CS 229: MACHINE LEARNING : GROUP 621 Sohan Mone, Devyani Choudhary, Ayush Singhania
6 Emilio Ferrara, Onur Varol, Clayton Davis, FilippoMenczer, and Alessandro Flammini. The rise of social bots. Communications of the ACM,
59(7):96{104, 2016.
7 Carlos Merlo (2017), "Millonario negocio FAKE NEWS", Univision Noticias
8 Chang, Juju; Lefferman, Jake; Pedersen, Claire; Martz, Geoff (November 29, 2016). "When Fake News Stories Make Real News Headlines".
Nightline. ABCNews.
9 https://www.cjr.org/analysis/facebook-rohingya-myanmar-fake-news.php
10 https://blog.paperspace.com/fake-news-detection/
11 Eni Mustafaraj and Panagiotis Takis Metaxas. The fake news spreading plague: Was it preventable? arXiv preprint arXiv:1703.06988, 2017.
12 Niall J Conroy, Victoria L Rubin, and Yimin Chen. Automatic deception detection: Methods for finding fake news. Proceedings of the Association for
Information Science and Technology.
13 Martin Potthast, Johannes Kiesel, Kevin Reinartz, Janek Bevendor, and Benno Stein. A stylometric in-quiry into hyperpartisan and fake news. arXiv
preprint, arXiv:1702.05638, 2017.
14 David O Klein and Joshua R Wueller. Fake news: A legal perspective. 2017.
15 Andrew Ward, L Ross, E Reed, E Turiel, and T Brown. Naive realism in everyday life: Implications for social conict and misunderstanding. Values
and knowledge, pages 103{135, 1997.
16 RaymondSNickerson. Conrmation bias: A ubiquitous phenomenon in many guises. Review of general psychology, 2(2):175, 1998.
17 Alessandro Bessi and Emilio Ferrara. Social bots distort the 2016 us presidential election online discussion. First Monday, 21(11), 2016
18 Michele Banko, Michael J Cafarella, Stephen Soder-land, Matthew Broadhead, and Oren Etzioni. Open information extraction from the web. In
IJCAI'07.
19 Amr Magdy and Nayer Wanas. Web-based statistical fact checking of textual documents. In Proceedings of the 2nd international workshop on Search
and mining user-generated contents, pages 103{110. ACM, 2010.
20 Giovanni Luca Ciampaglia, Prashant Shiralkar,Luis M Rocha, Johan Bollen, Filippo Menczer, andAlessandro Flammini. Computational fact checking
from knowledge networks. PloS one, 10(6):e0128193,2015.
21 You Wu, Pankaj K Agarwal, Chengkai Li, Jun Yang, and Cong Yu. Toward computational fact-checking. Proceedings of the VLDB Endowment,
7(7):589{600, 2014 [22] Baoxu Shi and Tim Weninger. Fact checking in het-erogeneous information networks. In WWW'16
22 https://www.huffingtonpost.in/2018/04/25/facebook-says-its-fact-checkers-will- stop fake-news-in-the-karnataka-election-well-just-have-to-believe-
them_a_23420278/
23 Christina Boididou, Symeon Papadopoulos, Markos Zampoglou, Lazaros Apostolidis, Olga Papadopoulou, Yiannis Kompatsiaris. "Detection and
visualization of misleading content on Twitter", International Journal of Multimedia Information Retrieval, 2017
24 Cody Buntain, Jennifer Golbeck. "Automatically Identifying Fake News in Popular Twitter Threads", 2017 IEEE International Conference on Smart
Cloud (SmartCloud), 2017
25 Zhiwei Jin, Juan Cao, Yongdong Zhang, Jianshe Zhou, Qi Tian. "Novel Visual and Statistical Image Features for Microblogs News Verification",
IEEE Transactions on Multimedia, 2017
26 Detection and visualization of misleading content on Twitter
27 Christina Boididou1,2 ·Symeon Papadopoulos2·Markos Zampoglou2·Lazaros Apostolidis2· Olga Papadopoulou2·Yiannis Kompatsiaris
28 https://www.analyticsvidya.com
29 https://www.ritchieng.com/machine-learning-evaluate-classification-model/
30 https://machinelearningmastery.com/
31 Automatically Identifying Fake News in Popular Twitter Threads
32 “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection,
33 William Yang Wang
34 https://github.com/likeaj6/FakeBananas
35 https://github.com/nishitpatel01/Fake_News_Detection
36 K.Venkata Rao B.Keerthana,” Sales Prediction on Video Games Using Machine Learning” Journal of Emerging Technologies and Innovative
Research, Vol.6, pg.326-331, 2019.
37 Song Feng, Ritwik Banerjee, and Yejin Choi. Syntactic stylometry for deception detection. In ACL‟12.
38 “Buzzfeednews:2017-12-fake-news-top-50,”https://github.com/BuzzFeedNews/2017-12- fake-news-top-50.
www.turkjphysiotherrehabil.org 5753
Turkish Journal of Physiotherapy and Rehabilitation; 32(3)
ISSN 2651-4451 | e-ISSN 2651-446X
39 N. J. Conroy, V. L. Rubin, and Y. Chen, “Automatic deception detection: methods for finding fake news,” Proceedings of the Association for
Information Science and Technology, vol. 52, no. 1, 2015, pp. 1–4.
40 K. Leela Prasad, P. Anusha, M. Srinivasa Rao, K. Venkata Rao.(2019),” A Machine Learning based Preventing the Occurrence of Cyber Bullying
Messages on OSN”, International Journal of Recent Technology and Engineering,8(2),pp. 1861-1865.
41 V. L. Rubin, Y. Chen, and N. J. Conroy, “Deception detection for news: three types of fakes,” Proceedings of the Association for Information Science
and Technology, vol. 52, no. 1, 2015, pp. 1–4.
42 Dong ping Tian et al. A review on image feature extraction and representation techniques. International Journal of Multimedia and Ubiquitous
Engineering, 8(4):385– 396, 2013
43 Local tampering detection in video sequences Paolo Bestagini, Simone Milani, Marco Tagliasacchi, Stefano Tubaro
www.turkjphysiotherrehabil.org 5754