Not Everything You Read Is True Fake News Detection Using Machine Learning Algorithms
Not Everything You Read Is True Fake News Detection Using Machine Learning Algorithms
Abstract—— This paper considers establishing if a news claims and errors all come under this category. Many
article is true or if it has been faked. To achieve the task researchers and big companies like Facebook and Google are
accurately, the work compares different machine learning working on this challenging area of research to find solutions
classification algorithm with the different feature extraction to stop the spread of such malicious information to the public.
methods. The algorithm with the feature extraction method This research studies the detection of fake news, the model
giving the highest accuracy is then used for future prediction of that can contribute towards the initial step of categorizing a
the labels of news headlines. In this work the algorithm show to news context as fake or not fake.
have the highest accuracy was logistic regression with 71%
percent accuracy when used with tf-idf feature extraction II. HOW FAKE NEWS SPREADS
method.
Recent studies have given a definition of fake news as a
Keywords— Bag of words, natural language processing, type of propaganda that consists of deliberate misinformation
machine learning, k-nearest neighbor, term frequency inverse or hoaxes spread via traditional print and broadcast news
document frequency. media or online social media. It is forged information that
mimics the traditional media but not with the correct intention.
I. INTRODUCTION They lack the accuracy and norms for publishing content. It
As our lives are increasingly focused around online is written to violate a group, person, agency or to gain
interaction many of our tasks, seeking new information or financial revenue and political attention by using false
news have been directed to online social media services headlines to increase online sharing, readership of the content.
instead of the traditional information sources like newspapers People today are using mass media as a source of earnings, by
and radio. The reasons that account for these changes are manipulating the information in different ways. This leads to
simple to define, first compared to the traditional mediums the production of news articles that are not completely false
social media is less time consuming and less expensive and nor completely true. This has led to the spread of information
second, it has made sharing, putting forward individual’s with various degrees of truthfulness[3].
opinion straight-forward, just a single click. Indeed, social The origination of fake news is not recent, it first came into
networks are used for multiple information and business reports in 16th century when the mode of communication and
purposes, apart from acting as an interacting tool. They are information was printed copies of pamphlets and newsbooks.
the major competitors for other media outlets such as In 1654, a news pamphlet reported news of a monster with
newspapers, radio and television. They are platforms that goat legs and a human body having seven arms and seven
allow the general public to share and produce news, which heads in Catalonia. In 1611 a newsletter had a news of a
was once restricted to news agencies. It has been shown that woman surviving for 14 years without food and water which
in US the amount of people who rely on social media as the is a story that prevailed at that era. Later, as resources
platform to gather news has been rapidly increased in the past increased so too did the spread of such disinformation. In 1835
few years. In 2012, only 49 percent of people were reported reports suggest that the news of a “Giant man-bats that spent
as seeing news in social media while now it is almost 70 their days collecting fruit and holding animated conversations,
percent of the population which is dependent on social goat-like creature with blue skin” witnessed by a British
media[1]. Although the advantages provided by social media astronomer made a sensation rapidly increasing the sale of the
are many compared to the traditional media the quality of newspaper which printed it. More modern, and increasingly
information has to be compromised[2]. Since, it is easy to dangerous examples include cures for the COVID-19 virus
disseminate news online, passing on fake news for multiple which have no medical substance.
of reasons like political or financial is an emerging problem.
In 2016 during the US president election the term fake news The digitization of news has overtaken the traditional
came into the picture and has continued to be the most media by increasing the reach of fake news. It has provided a
common term since then. During the last few years many platform for non-journalists too, to spread out their views in
examples have been seen where spreading such false the public in a technique referred to as gonzo journalism. This
information has become a business[3]. Information that is is exacerbated by citizen or street journalism based on populist
written using incorrect information, outdated facts, biased views. This increasing spread of fabricated information
Authorized licensed use limited to: Letterkenny Institute of Technology (LYIT). Downloaded on February 16,2022 at 15:27:51 UTC from IEEE Xplore. Restrictions apply.
among the audience have challenged the traditional beliefs of or cultural background to interpret the fakeness of the news,
news accuracy. Traditional news media has given way to thus ends up believing the fake news. Furthermore, on social
social media sites like Twitter, Facebook and TikTok. It is media platforms like Facebook and Twitter, stories which are
difficult to distinguish between the actual and forged source. “liked” or “shared” all appear in a common visual format.
Since, the sources are numerous there is apparent veracity in Unless a user looks specifically for the source attribution, an
the news which is shared in social media. The question lies is article from The Onion looks just like an article from a
of the authenticity of the site and source of the information credible source, like The New York Times. The Onion,
communicated. Clickhole and the New Yorker’s Borowitz report are some of
Recently, the rise of the word “echo chamber” in relation the example of satirical news sites[7].
to online media have emerged as an issue of concern. In
general, echo chamber is termed as a situation where certain IV. MODEL CREATION
ideas or beliefs communicated are repeated again and again in A. Data Collection
a closed chamber, without allowing the alternative ideas or
concepts to be reinforced[5]. It is basically a closed group The process of data collection followed for this research is
sharing similar information they agree with. It exists in many secondary as the data gathered is from online sources. To
forms such as blogs, forums and social media sites. With the accomplish the task of fact checking of news articles to see if
social media acting as a large chamber for any information or its fake or not fake the requirement was to gather some
ideas to be shared, regardless of them being true, the false labelled headlines. Csv files from two data sources were
information is getting echoed within the community of people collected. First file factcheck.csv was taken from GitHub
connected in social media. Thus, affecting the public opinions repository of data science project. The file consisted of the
with false beliefs[6]. following columns- news headline of USA, its source, the
labels, the author, medium of broadcast and the type of news.
III. TYPES OF FAKE NEWS The dataset was filtered, and data of interest was sieved from
the file. The sieved data consisted of news headlines and the
Based on the intention and type of misinformation fake label as true, false, half-true, barely-true, mostly-true and
news have been categorized into following classes as pants-fire. The labels were further grouped into only two
discussed below – classes true or false by replacing the half-true, mostly-true as
• Clickbait true and barely-true, pants-fire as false. The second file, a
News that has no grounding on fact, their main goal is publicly available dataset, fake_or_real_news.csv was taken
to generate revenue, by generating large web traffic, that from Kaggle, which is an online platform build for data
is by generating clicks. Basically, popup websites such scientists and machine learners owned by Google LLC to
as Macedonian teenagers fall in this category. Short share datasets and build data science models. The data set
messages in the networking sites that convinces the chosen contained over 6000 new items with 50% labelled true
people to read more by clicking on the link below, is the and the remaining 50% labelled as fake. The file consisted of
source of earning by causing the audience to visit the four columns news Id, news headline, text and label saying
page containing ambiguous information. fake or real news. The links for both the datasets is as given
below.
• Propaganda
These news articles are conflating fact and opinion, that Link 1:
are nakedly supportive of one political party. These are https://github.com/UCSBdataScienceProjectGroup/fakeN
misleading articles that are sometimes created from ewsPrediction/blob/master/factCheck.csv
official sources with a political motive by positioning
themselves as alternatives to the opposition in Link 2:
mainstream media. https://www.kaggle.com/rchitic17/real-or-
• Commentary/Opinion fake#fake_or_real_news.csv
Opinions are one of the biggest influencers of every
human activity. We think, we judge, we have B. Feature Extraction
sympathies and we make choices based substantially on In order to make the machine understand the data for
how other people perceive the reality we live in. The analysis it needs to be converted into a machine-readable
news sites can now use these tactics to influence people format. With machine learning algorithms some human tasks
towards a political party for example. The goal here is have been made easy as they perform them in the same way
not revenue, but influence. Fabricated stories are often as human do them, but they still lack the capability of
mixed with true or sensational ones. Example - Outlets understanding human language as it is spoken. Hence, this
in Russia might produce content to swing public language needs to be further changed into numerical content,
opinion, sow division and gave the illusion of which computers can interpret easily. To accomplish this task
the Bag of Words (BoW) model is implemented with machine
supporting a particular candidate.
learning algorithms. BoW is basically a way of extracting
• Satire/Humor features from a document. The idea behind this model is to
News stories written for entertainment often lead to the build a vocabulary of words from the text document and
production of Fake news, especially if they touch on current measure the occurrence of each word in that document. The
events or politics and if they appear free of context on social model is only concerned on the occurrences of words not for
media. It generally mimics the style and format of journalistic where they are in document. The python scikit-learn library
reporting which often misleads the inattentive reader and has made the implementation of this model easy by providing
creates confusion with legitimate news. The readers who are different methods discussed below which vectorize the
unaware of the satirical nature of the news or lacks contextual document with different approaches as followed in this work-
Authorized licensed use limited to: Letterkenny Institute of Technology (LYIT). Downloaded on February 16,2022 at 15:27:51 UTC from IEEE Xplore. Restrictions apply.
• Count vectorizer shown in the figure2. The columns in the matrix projects the
actual data value while the rows give the predicted values.
This is considered as the simplest vectorizer among Furthermore, the matrix explains the following points –
those available[8]. Count vectorizer counts simply
the number of times each token has appeared and 1. Out of 133 false or fake news headlines, the model was
assign this count as weight for the token. This helps successful in predicting 98 headlines correctly as fake news.
in building the vocabulary of the whole document That accounts to almost 73 percent of the prediction for the
which can be further used for encoding other false news was correctly classified.
documents.
2. The true or not fake headlines accounted for 164 number
• Tf-idf Vectorizer in count in the test data. In which 115 were correctly
classified as True headlines by the model build in this
Tf-idf short term for term frequency inverse research.
document frequency vectorizer maintains the word
count as well as offset for each word in the
document. The expression term frequency refers to
the count of occurrence of the words in the
document which is directly proportional to the
weight of the word[9]. The term inverse document
frequency refers to the adjustment of the offset of the
words, the words such as ‘is’, ‘the’ which occurs
very frequently in the document are assigned less
offset as compared to the other words.
• Hashing Vectorizer
This vectorizer works similar to the hashing function
while programming. The words are mapped to the
fixed size set of numbers in hash space[10]. The
hash space is chosen based on the vocabulary of the
document. This vectorizer helps in increasing the
Fig. 2. Confusion matrix for test data predicted
efficiency of the memory. The tokens are no longer
stored as strings rather they are stored as indexes for
the hash table. A. Evaluating model with five true headlines.
C. Building Model In order to test the model five news headlines which are true
In order to classify the news as fake or not fake different from across the globe were taken and tested with the fake
machine learning classification algorithms are implemented news detector model. The table 1 shows the headlines and the
and the algorithm with the highest accuracy is saved as the class classified after the prediction the fake news detector
model for the project. The different classification algorithms model when tested with the headlines from countries USA,
implemented for this project are logistic regression, decision UK and India was successful in giving the 2 out of 5 results
tree, K-nearest neighbor and random forest. The data in all the correctly. In which out of the first three headlines from USA,
algorithms is fed after the features are extracted with the three it successfully detected the two true headlines as true.
vectorizers. However, it failed to classify the third correctly. The model
failed when the news headlines from other countries were
V. MODEL EVALUATION detected.
The data is divided into 80 percent training and 20 percent News Headline Class Predicted
testing data for evaluating the different models.
Implementation of four classification algorithms with the
different vectorizer gave the result shown in figure 1. The Trump is the new President of True
USA
logistic regression algorithm when implemented after
extracting feature with tf-idf vectorizer gave the highest White house to focus on True
accuracy of 71 percent while testing the model. investment in middle east as
part of peace proposal.
Katie Hill said: ’Old Men’ False
should not make decision on
abortion.
Brexit bill should include False
public vote.
India exit poll suggests Modi False
re-election.
Fig. 1 Summarizing Results
Table 1. True headlines and label predicted
The confusion matrix, a performance measurement for the
machine learning algorithms was plotted for the model is as
Authorized licensed use limited to: Letterkenny Institute of Technology (LYIT). Downloaded on February 16,2022 at 15:27:51 UTC from IEEE Xplore. Restrictions apply.
B. Evaluating model with five false headlines. Basically, the model can be very helpful with the detection of
The model was further tested with the fake headlines from fraud contents present in form of text online.
across the globe in order to test the model. Table 2 shows that
the fake news pertaining to USA, the fake news detector ACKNOWLEDGMENT
classified it correctly as False, as seen in the results of first and I would like to extend my gratitude to a few people,
third news headline in table, however, it again lacked in giving without whom this research wouldn’t be possible.
the correct result for one of the USA fake headline, due to its
I am very grateful to the Head of Department, Thomas
accuracy percentage. Interestingly, it gave a correct result for
Dowling for his valuable time and guidance. I would also like
one of the fake headlines from Ireland but failed for the second
to give a special thanks to my supervisor Ruth Lennon, who
last headline.
assisted me with her ideas and knowledge to come up with this
News Headlines Class Predicted work.
Trump to begin war with False
I want to thank all my lecturers who gave me excellent
Pakistan.
Michelle Obama to be the True
knowledge during my masters; Dr. James Connolly, Dr.
next President of USA Eoghan Furey, Edwina Sweeney, Dr. Mark Leeney and Dr.
Trump to visit India this week. False Michael McCann.
Michel D. Higgins to meet True I would like to thank my parents Dr. Vinay Tiwari and Dr.
trump this June. Vineeta Tiwari for their love and support throughout the work.
Leo Varadkar refuses to meet False Lastly, I want to thank my friend Goutham Siddharth for his
Corbyn for Brexit talks. constant motivation..
Table 2. False headlines and label predicted.
REFERENCES
VI. CONCLUSION
[1] H. Allcott and M. Gentzkow, ‘Social Media and Fake News in
The research was successful in building a fake news detector the 2016 Election’, Journal of Economic Perspectives, vol. 31,
to classify the news headlines or text as fake or not fake by no. 2, pp. 211–236, May 2017.
analysing the news headlines with labels. This research [2] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, ‘Fake News
Detection on Social Media: A Data Mining Perspective’,
focused on the study of different machine learning algorithms
arXiv:1708.01967 [cs], Aug. 2017.
with the various feature extraction models. Though, the prior [3] A. Campan, A. Cuzzocrea, and T. M. Truta, ‘Fighting fake news
study in the area of fake news detection using machine spread in online social networks: Actual trends and future research
learning are immense, this work tries to collaborate them and directions’, in 2017 IEEE International Conference on Big Data
compare the accuracy with different alterations of feature (Big Data), 2017, pp. 4453–4457.
extractors. During the study it was observed that the accuracy [4] ‘What is an Echo Chamber? - Definition from Techopedia’,
of prediction while using machine learning algorithms on the Techopedia.com.[Online].Available:
text data is dependent on both its feature extraction method https://www.techopedia.com/definition/23423/echo-chamber.
and the algorithm selected. The fast and efficient the [Accessed: 03-Jun-2019].
[5] K. Garimella, G. D. F. Morales, A. Gionis, and M.
extraction method, it will convert the text more effectively in
Mathioudakis, ‘Political Discourse on Social Media: Echo
machine -readable format which later can be work on with Chambers, Gatekeepers, and the Price of Bipartisanship’,
different algorithms. However, the feature extraction model arXiv:1801.01665 [cs], Jan. 2018.
will vary from type of data fed. It is not necessary that a [6] J. Titcomb and J. Carson, ‘Fake news: What exactly is it – and
particular model always give the best results compared to how can you spot it?’, The Telegraph, 14-Nov-2017.
others. The same principle was observed with the machine [7] J. Brownlee, ‘How to Prepare Text Data for Machine Learning
learning algorithms, the accuracy varies on the training data. with scikit-learn’, Machine Learning Mastery, 28-Sep-2017. .
In this research based on the training data provided, the [8] I. Arora, ‘Document feature extraction and classification’,
logistic regression algorithm gave the highest accuracy when Towards Data Science, 19-Mar-2017. [Online]. Available:
https://towardsdatascience.com/document-feature-extraction-
compared with the other algorithms.
and-classification-53f0e813d2d3. [Accessed: 30-Apr-2019].
The research was thus successful in accomplishing the [9] J. Blanco, ‘Hacking Scikit-Learn’s Vectorizers’, Towards
task of the detection of fake news. With the present scenario Data Science, 15-Feb-2018. [Online]. Available:
of growing problem of fake news, there is a demand of https://towardsdatascience.com/hacking-scikit-learns-vectorizers-
research in this area to develop and improvise new 9ef26a7170af. [Accessed: 30-Apr-2019]
technologies to build methods that can save public from
getting trapped. There is a need to build systems that can alert
the public and people can actually rely on which can guide
them in the right path. The major application area for the fake
news detectors is the online platform. The popular social
media sites where people share their thoughts or forwards
from other sources, this fake news detector can read text and
classify the text as fake or true at that instant, which will help
others to know the type of news shared. Apart from the social
media, the detector can also be used in the online news
websites, the websites which publish news articles from the
non-trusted sources can be judged as the article is published
saving the public and the chaotic situations in politics.
Authorized licensed use limited to: Letterkenny Institute of Technology (LYIT). Downloaded on February 16,2022 at 15:27:51 UTC from IEEE Xplore. Restrictions apply.