0% found this document useful (0 votes)
25 views

Big Data ML-Based Fake News Detection Using Distributed Learning

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Big Data ML-Based Fake News Detection Using Distributed Learning

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Received 4 March 2023, accepted 20 March 2023, date of publication 22 March 2023, date of current version 28 March 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3260763

Big Data ML-Based Fake News Detection Using


Distributed Learning
ALAA ALTHENEYAN AND ASEEL ALHADLAQ
Department of Computer Science and Engineering, College of Applied Studies and Community Services, King Saud University, Riyadh 11495, Saudi Arabia
Corresponding author: Alaa Altheneyan ([email protected])
This work was supported by Researchers Supporting Project number (RSPD2023R532), King Saud University, Riyadh, Saudi Arabia.

ABSTRACT Users rely heavily on social media to consume and share news, facilitating the mass dis-
semination of genuine and fake stories. The proliferation of misinformation on various social media
platforms has serious consequences for society. The inability to differentiate between the several forms of
false news on Twitter is a major obstacle to effective detection of fake news. Researchers have made progress
toward a solution by emphasizing methods for identifying fake news. The dataset FNC-1, which includes four
categories for identifying false news, will be used in this study. The state-of-the-art methods for spotting fake
news are evaluated and compared using big data technology (Spark) and machine learning. The methodology
of this study employed a decentralized Spark cluster to create a stacked ensemble model. Following feature
extraction using N-grams, Hashing TF-IDF, and count vectorizer, we used the proposed stacked ensemble
classification model. The results show that the suggested model has a superior classification performance of
92.45% in the F1 score compared to the 83.10 % F1 score of the baseline approach. The proposed model
achieved an additional 9.35% F1 score compared to the state-of-the-art techniques.

INDEX TERMS Big data, machine learning, fake news, ensemble learning, social media.

I. INTRODUCTION Many automatically assume that the news is either bogus


The use of social media platforms to disseminate and digest or legitimate based on the article’s content. Techniques based
media has increased in recent years. Social networking sites on news content use methods for collecting data and tone
like Facebook and Twitter generate daily data [1]. It is from fake news stories. The goal of style-based methods
no secret that the internet is a goldmine of information, for de-detecting false news is to utilize the manipulators’
especially recent news [2]. The proliferation of fake news writing styles for detection. By examining certain language
is directly attributable to the internet’s user-friendly nature. features, we can distinguish fake news from the real thing
Since fake news is often presented as factual, it is often [3]. However, false news is created with the intent of fooling
shared on social media. Often, this data is spread for profit readers. Thus, improving the detection of false news using
or influencing politics. The effects of fake news on society as news content style is a difficult problem. To assist in avoiding
a whole are profound. In the light of its profound impacts, the difficult and time-consuming human work of fact-
fixing this issue is crucial [3]. Multiple instances of false checking, the Natural Language Processing (NLP) industry
news were reported to have spread on social media during has shown considerable interest in automatic recognition
the 2016 US elections, including the presidential election of fake news [6], [7]. Determining the integrity of news
and the nomination of a new Air Marshal in India [4]. The is a difficult task, even for automated approaches [8].
dissemination of false information has negatively affected Familiarizing with what other news outlets say on the same
people’s mental health and society as a whole [5]. issue might be a useful starting point for recognizing false
news. Identifying a person’s position is the purpose of this
The associate editor coordinating the review of this manuscript and phase. Multiple tasks, such as evaluating online arguments
approving it for publication was Chong Leong Gan . [9], [10], verifying the integrity of Twitter rumors [11], [12],

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 11, 2023 29447
A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning

FIGURE 1. Overview of the headline and text bodies with their respective stances.

or understanding the argumentation structure of seminal Fools’ jokes, rumors, clickbait, or stated opinions posted
works [13], [14], have traditionally relied on position online with incorrect facts.
identification. In this research work, ‘‘fake news’’ is defined as a written
In the first example of evaluating the first False News article that is manifestly untrue and falsely disseminated
Challenge (FNC-1), a media news source discusses a topic without being authentic mostly accompanird by malicious
to create automated fake news detection systems using intents. This definition includes three important textual,
AI technology and machine learning. Almost fifty groups visual, and audio bases. Other elements such as video-based
from industry and academics worked on this problem. One fake news and audio, are typically ignored when referring to
of the objectives of the FNC-1 challenge is to track out textual fake news; additionally, each element has its linguistic
a media production dealing with a certain title. It might complexities that necessitate different machine learning and
support, challenge, or have nothing to do with the title. Four deep learning algorithms to detect and solve problems such as
potential vantage points from which an essay is to be written. ‘Deep Fake,’ etc. The notion also implies that fake news might
The guidelines, dataset, and grading criteria for the FNC- be fact-checked, an important characteristic. Therefore, the
1 challenge are all available on their site. These topics are claims may be checked to see if they are true or false. Because
further shown in Figure 1, which depicts the results of four rumors are usually hard to verify, they are deleted from
distinct research. the definition because of this inclusion. Conspiracy theories
Multiple deep learning and Recurrent Neural Networks are classed as rumors because they are persistent rumors
(RNN), as well as their modifications, including Convolution that are difficult to refute. False information concerning the
Neural Networks (CNN) [15], are often employed for NLP entertainment sector, including hoaxes and April Fools’ gags,
tasks and have shown to perform magnificently on NLP- is not permitted because the objective must be harmful.
related tasks [16], [17], [18]. Furthermore, the goal is infamous as it seeks to affect public
opinion in favor of a specific message. It also removes
A. OVERVIEW OF FAKE NEWS DETECTION text bits that were mistakenly published improperly, such as
In 2017, Facebook released a white paper that explored transposed numbers.
the risks of online communication and the management of A model of the connection between headlines and news
being one of the most prominent social media platforms content is necessary for identifying clickbait. It is also crucial
today. Weedon, Nuland, and Stamos also noticed the growing to tell the difference between false news and clickbait. The
challenge of using the enigmatic phrase ‘‘fake news,’’ and term ‘‘clickbait’’ refers to articles with enticing headlines
proclaimed that ‘‘the overuse and misapplication of the written to attract online audience or traffic; when people click
term ‘‘fake news’’ might be challenging since we cannot on such a headline, they end up at a different website with
understand or adequately address these concerns without poorly written articles that have nothing to do with the subject
shared definitions’’ [19]. The word can apply to anything line. So, clickbait is written with one goal: getting more
from virtually incorrect news articles to deceptions, April people to visit a website that relies on advertising to make

29448 VOLUME 11, 2023


A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning

money. The motive is monetary gain rather than furthering a such as Twitter, Facebook, and Instagram provide an ideal
political agenda via disseminating false information. ground for quickly transmitting fake news. Furthermore,
A great example is the deliberately spread of false news bots are increasingly being utilized to distort information,
about Hillary Clinton by Russian trolls in the 2016 presiden- disrupt social media conversations, and draw users’ attention,
tial election campaign, which was designed to affect people’s according to the same author.
voting choices away from Hillary and toward Donald Trump.
This instance demonstrates how dangerous it can be when 2) USERS’ RE-SHARING BEHAVIOR AND FAKE NEWS
false information spreads on critical issues. Of course, there’s From the perspectives discussed so far, it can be deduced
another problem with false news, toxic information is spread that social media sites play an essential role in disseminating
for no reason to sow doubt, stir up chaos, and make it difficult false information. Furthermore, internet users are to blame
for readers to tell fact from fiction. for spreading false information. There are two main types of
data sharing on online sharing sites: self-disclosure, in which
1) SOCIAL MEDIA AND FAKE NEWS a user voluntarily discloses private information, and re-
Global knowledge dissemination has been democratized sharing, in which a user distributes material already created
because of technological advancements and the emergence by another user of the site or a third party. Distributing
of social media. Important news organizations have invested low-quality, erroneous, or purposefully misleading material
heavily in digital journalism, generating content for media may have negative implications, such as spreading false
platforms, and growing their reach via social media and news, but spreading high-quality information can assist in
online tools. Furthermore, online social media platforms are development of a more informed community. One of the
becoming most important sites for information spreading. most common ways information is disseminated online is
Dissemination of information allows for the exchange by re-sharing, which includes retweeting, re-posting, re-
of ideas and the connectivity of previously inaccessible vining, and re-blogging. In social media, for instance, it is
locations. It enables users to generate opinions about the common practice for users to write articles, distribute them
information platforms offer from many perspectives. among their networks, and engage in related online discourse.
In the past, media companies have invested heavily in Social media users may engage in this practice with various
creating their presence online, with online media networking apps. Sharing information rapidly is essential in many
sites playing a significant role. They use social media situations, including political campaigns and times of crisis,
platforms such as Facebook and Twitter to promote their and therefore sites like Twitter, YouTube, and Facebook have
material, spread information/news, and develop a network become more important. Individuals are also using social
of individuals they may engage with. On the other hand, media accounts for news production and dissemination.
users benefit from social media’s technical developments In the case of social media, for instance, someone may
since people now have access to a wide range of information spread false information (or even create a fake tale and post
sources. it). Resharing is a feature of many social media sites, so if one
The current digital landscape for information dissemina- person shares a story, it increases the likelihood that others
tion and the challenges that media organizations face in an will do the same. Several proposed remedies are present,
ever-present media environment have resulted in substantial but there is still much disagreement over what constitutes
changes in how news organizations are founded. Economic, ‘‘fake news’’, how it spreads, and how it affects social
technical, and social pressures have combined with the desire and political outcomes. Multiple major actors—including
to be always noticeable, race of reporting with similar speed social media platforms, users, and groups against the spread
and excitement, getting followers, creating an atmosphere of fake news—may be able to control the spread of false
where fake news is prevalent. information on the internet. This brief theoretical overview
The latest technological advancements in social media of the Uses and gratifications theory (UGT), the filter
have undoubtedly provided a hostile environment for spread- bubble phenomenon, and social media re-sharing behavior
ing online lies in a primarily deregulated media financed and provides important context for the current investigation.
driven by advertising. The motivation for good is usually According to UGT’s research, the Ellinika-Hoaxes-Facebook
overshadowed by the desire for profit, which significantly demographic represents an engaged audience searching for
influences how the medium changes over time. According to high-quality news and information from sources outside
the above, fake news exists on social media alongside real their echo chamber via media consumption. Know that this
news, and the difficulty appears to be distinguishing them. demographic is engaged, actively looking for information and
While fake news is not new, the speed it travels and the trying to confirm the integrity of rumours they may have
worldwide reach of the instruments that can distribute it are seen on social media. Users’ familiarity with the Internet,
unprecedented. Consequently, fake news emerges on social social media, and other media is crucial for identifying the
media in the same context as actual news, and the problem prevalence of false news on these platforms and stopping
appears to be discerning between the two. While fake news its spread. To properly answer the Formulation of research
is not a new phenomenon, the pace and quantity with question (RQ) and draw conclusions on how members of
which it is distributed have changed: social media platforms the Ellinika-Hoaxes-Facebook group use particular media to

VOLUME 11, 2023 29449


A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning

FIGURE 2. Category of fake news on social media.

detect and prevent the spread of false news, it is necessary to broadcast, social media, and digital platforms. Before the rise
conduct research into their online behavior. of social media, this was seen as a concern because of its role
in disseminating false information. Multiple psychological
B. FAKE NEWS CHARACTERIZATION and social scientific foundations are used to characterize the
The principle of fake news has two components: authenticity effects of false news on individuals and the social knowledge
and purpose. The word ‘‘authenticity’’ refers to the fact that environment. Humans aren’t great at spotting believable
misleading news often contains false information that may be stories from those who aren’t. Several psychological and
demonstrated to be untrue. Conspiracy theories, for example, perceptual theories explain this phenomenon and the impact
are not included in the definition of fake news since it is nearly of misleading information. Traditional false news exploits
hard to tell whether they are real or false in most situations. readers’ emotional vulnerabilities. Incorrect information is
According to the second component, the erroneous material’s more likely to irritate clients due to the following two major
objective was to deceive the reader. Figure 2 represents the factors:
category of fake news on social media. The characterization • Customers with naive realism believe that their view of
module represents fake news belonging to traditional media the world is valid and that others who disagree with them
and social media. The second module shows the fake news are irrational or dishonest [21].
detection techniques used for both traditional and social • People are more likely to be presented with data that
media. backs with their existing worldview. The cognitive
First, to identify fake news, understand the text context biases that are part of the human condition led con-
and the procedure to categorize it. It is vital to begin by sumers to regularly confuse fake news with the genuine
characterization when developing detection models, and it is thing [22].
also necessary to grasp what fake news is before attempting By analysing the news ecosystem as a whole, we may be
to identify it. It is also not easy to develop a universally able to pinpoint some of the societal factors that fuel the
agreed definition for ‘‘fake news: Stories that are purposely spread of disinformation. Theories of Social Identity [23]
and verifiably misleading and mislead readers’’. As per and Normative Influence [5] argue that the need for others’
Wikipedia, deliberate misinformation or hoaxes spread via approval is central to a person’s sense of self and identity,
multiple online platforms and news channels or digital social which increases the likelihood that users will prefer the
media constitute a sort of fake journalism or propaganda [20]. anonymity and security of online platforms when obtaining
Today’s fake news is manipulative and diversified in topics, and sharing news content, even if it is false.
techniques, and platforms. It consists of two components:
authenticity and intent. Fake news material that contains D. THE EXTRACTION OF FEATURES
inaccuracies that may be verified falls under authenticity. Unlike social media, where additional social data may help
However, it excludes conspiracy theories because they are identify false news, conventional news organizations rely
difficult to prove actual or wrong in most circumstances. on content like text and photographs to spot and identify
The second part refers to the misleading material written to fake news. Some representative features of false news were
deceive the reader. shown in Figure 3. We will next examine how to extract and
disseminate relevant data from the media.
C. TRADITIONAL MEDIA FAKE NEWS
The media ecosystem supporting the spread of false informa- 1) TEXTUAL CONTEXT BASED
tion has grown and evolved throughout time, including print, Three important methods to make up news content:

29450 VOLUME 11, 2023


A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning

FIGURE 3. Feature representation of fake news.

• Source-Where it takes the news or a piece of news one is lexicon, the second is semantics, then discourse
getting source, who published it, and source is authentic and syntax. Structure-related features are also technique-
or not. oriented features because most quantification depends on
• Headline-A detailed summary of the news’s quality to NLP-based methods. The critical challenge at the lexical
entice readers. level is identifying the frequency statistics of a word(s),
• Body Text-It shows the actual story/content of the news. letter(s), or other entity, which may be done correctly
The most common method for detecting false information is by applying n-gram models. Part-of-Speech (POS)-taggers
to look at the content of the news piece. The substance of a execute shallow syntax tasks at the syntax level, making
news report is generally separated into two types: textual and tagging and assessment of POS easier. Probabilistic Context-
visual. Much of the news material is presented in the textual Free Grammars (PCFG) analyses Context-Free Grammars
mode, one of these modalities. As previously said, fake news (CFG) by performing deep syntax level operations with
consists of manipulating the audience, and it does so via parse trees. On the semantic level, word count (WC) and
the use of specific terminology. Non-fake news, however, linguistic inquiry are also utilized to create semantic classes
is usually transferred to a separate language list since it for semantic features.
is more legitimate. Attribute-based language characteristics
and structure-related language features are two common E. PROBLEM FORMULATION
categories. Developing a Spark distributed cluster-based environment
for efficiently detecting fake news articles via a supervised
2) ATTRIBUTE-BASED LANGUAGE FEATURES
learning paradigm necessitated solving two sub-problems.
They involve the ten parallel aspects of content style’s
First, our model needed to learn how to recognize and seize
linguistic elements. These aspects involve volume, uncer-
necessary information in lengthy and textual news articles
tainty, objectivity, emotions, diversity, and readability [24].
for categorizing the association between news item titles and
Although attribute-based language characteristics are gener-
related meta descriptions.
ally extremely important, explainable, and predictable, they
are often useless in assessing deception style compared F. RESEARCH OBJECTIVES
to structure-based features. Furthermore, attributed features
In the first section of this research, we examine the effec-
require extra resources for deception detection, which may
tiveness of Recurrent Neural Networks (RNN) in modeling
take longer and significantly focus on correct feature
news articles to identify the link between an article’s body
evaluation and filtering.
content and its title. As part of our research, we use the
3) STRUCTURE-BASED LANGUAGE FEATURES dataset made available for the FNC-1 competition to train and
Content style is defined by structure-based linguistic prop- assess a classifier. We want the classifier to be able to do the
erties and must have four levels of language: the first following.

VOLUME 11, 2023 29451


A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning

FIGURE 4. Graphical representation of proposed approach.

•Use the Spark framework to research, assess, and efficient stacked ensemble classifier for fake news
compare several machine learning classification techniques detection.
on four classes from the FNC-1 dataset. In an experiment, we demonstrate that the recommended
•Given a title and an article, determine if the article agrees method can accurately identify fake news and beats current
with, disagrees with, discusses, or is irrelevant to the assertion state of the art algorithms.
made in the headline.
• To propose an efficient, systematic, and functional G. PAPER LAYOUT
approach based on machine learning algorithms for The remaining paper contains the following sections. Related
detecting fake news using Spark and to design an work is reviewed in section II. The dataset used for

29452 VOLUME 11, 2023


A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning

experimentation and preliminaries is discussed in section III. view concerning a specified target, such as a single topic,
The experimental results and discussion are articulated in headline, or even a person [15], [29]. Consequently, there
section IV. Finally, section V presents a conclusion, and are three factors and a machine learning based categorization
future work. technique to determine how the comparison occurs. The
group’s titles (for example: help, against, for, or neutral)
II. LITERATURE REVIEW are determined by the issue. Political arguments [30], [31],
This section provides an overview of the previous research’s articles [32], [33], and even internal company dialogues [25],
difficulties in identifying fake news. To identify fabricated [34], which stretches a wide range of fields may be referred
news stories, it is necessary to do rumor detection and to as categories. Detecting the stretch of Tweets or short texts
identification. It is important to distinguish between Real such as hearsays [35] or microblogging accounts has gotten
and fake news since both are based on deliberate fab- much attention in opinion mining. ‘‘Hillary Clinton’’ as a
rication. Fake news identification is particularly difficult celebrity, ‘‘Atheism’’ as a specific issue, or the profess that
when detecting news based on characteristics. Tweets and ‘‘E-cigarettes are safer than regular cigarettes’’ are examples
social context can be used to generate features. As a result, of objectives presented in the available datasets. Shared tasks
we assess prior work based on single-modality and stance for providing such datasets and promoting research have
identification. emerged in several languages.
The sub-task for exposing stance in Tweets [26] was
A. TEXTUAL CONTENT BASED presented at SemEval-2016, with roughly 5,000 tweets
Most earlier news identification studies relied mainly on in English, including five familiar subjects. The task has
textual elements and user metadata. Text based features are initiated a variety of approaches, including conventional
statistically extracted from message text content and have techniques (for example, KNN [36], SVM [22], or essen-
been extensively discussed in the literature on fake news tial attributes given by methods [34]) and deep learning
identification. The textual component extracts unique writing approaches (e.g., BiLSTM [37], Bidirectional Conditional
styles [15], [19], [20] and emotional sensations [18] that are Encoding [27], [34]). Furthermore, public datasets, for
prominent in fake news. instance, the Multi-Perspective Consumer Health Query
Network connections, style analysis, and individual emo- dataset [38] dedicated to exposing the stance of sentences
tions have all been proven to contribute to detecting fake taken from high-quality articles on five separate assertions.
news [19]. After reading these posts, [20] explored the writing Like ‘‘Sun exposure causes skin cancer,’’ the dataset is avail-
style and its effects on readers’ viewpoints and attitudes. able to work on the development of new and exciting work.
Emotion is a significant predictor in many fake news It contains an in-depth examination of various approaches to
detection studies, and most rely on user positions or simple the two goals listed above. The need for well-interpreted data
statistical emotional features to convey emotion. In [15] in languages other than English has rapidly increased notation
authors introduced a novel dual emotion-based method for efforts and collaborative tasks aimed at furthering research.
identifying fake news that can learn from publishers’ and There are efforts like Stance-Cat, an aim for identifying
users’ content, user comments, and emotional representation. attitudes in Spanish and Catalan tweets [39], a proposal and
Reference [25] employed an ML model for identifying fake database of brief statements in Russian online forums [40],
news that employs convolution filters to distinguish between and even projects that integrate several languages [41].
different granularities of text information. They investigated A group of volunteers from industry and academia
the issue of posture categorization in an innovative approach launched the Fake News Challenge in December 2016 [10].
to consumer health information inquiries and achieved 84% Using Machine Learning, Natural Language Processing
accuracy using the SVM model. (NLP), and Artificial Intelligence (AI), this competition
aimed to encourage the development of technologies that
B. SOCIAL CONTEXT BASED could assist human fact-checkers in detecting deliberate
User generated social media interactions with news stories deception in news reporting as a first step, the organizers
may give additional information, in addition to aspects decided to research what other media outlets have to say
directly relevant to the substance of the stories. In [26] authors about the topic. Consequently, they decided to introduce the
proposed a novel approach employing a knowledge graph to event with a stance detection challenge in the first round
identify fake news based on actual content. A graph-kernel- of competition. The organizers collected data on headlines
based approach used be [27] to discover propagation patterns and body text before the event. In the competition, they
and attitudes. On the other hand, social context features are asked participants to create classifiers that could reliably
difficult to gather because they are loud, unstructured, and classify a body text’s viewpoint on a given headline into
time-consuming [28]. one of four categories: ‘‘disagree’’, ‘‘agree’’, ‘‘discuss’’ or
‘‘unrelated’’. On this task’s test set, the top three teams
C. STANCE DETECTION OVERVIEW achieved accuracy rates greater than or equal to 80%. The top
From a broad viewpoint, stance detection can be elaborated team’s model combined Gradient Boosted Decision Trees and
as the problem of determining an author’s or text’s point of Deep Convolutional Neural Networks.

VOLUME 11, 2023 29453


A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning

D. MISLEADING HEADLINES news title and article content, outside the FNC-1 Challenge
Identifying misleading headlines in this research required and dataset. Several writers have compiled claims and
classifying each article’s treatment of the assertion made in criticisms [21], [47] to help with identification. Some analytic
the title into one of four categories: (a) agrees, (b) discusses, effort is devoted to ‘‘argument mining,’’ in which the headline
(c) disagrees, and (d) irrelevant (headline and different topic presents an argument not supported by the content. While
discussed in body text). As a result of the proliferation of argument mining is effective in solving the problem of
annotated corpora and the increased use of new technologies posture identification, other tasks that discover semantic
to combat the fake news pandemic, a new obstacle has relationships within the text, such as inconsistency detection
recently presented itself to the field of fake news analysis [8]. [48], contrast detection [49], and synthesis detection [50],
In this setting, several research challenges and competitions may also be useful. Mishra et al. provided a comprehensive
are presented. The most recent and important ones are then taxonomy for spotting false news, outlining the many
dissected in great detail. The evolving dataset [18] was used forms of disinformation and what sets them apart. Multiple
to create the fake news Challenge6 (FNC-1) [42]. The goal mechanisms exist to track down those who propagate false
of FNC-1 is to serve as a benchmark for research into AI- information. Multiple liar, false news, and corpus datasets
based technologies, machine learning, and natural language have been used to compare traditional machine and deep
processing as they apply to the detection of false news. The learning techniques. This study demonstrated that deep
planners decided to begin with stance disclosure to finish this learning methods outperformed more conventional machine
macro-challenge. The FNC-1 dataset, which included over learning strategies. Bi-LSTM outperforms the competition in
75,000 instances labelled as either ‘‘agreeing,’’ ‘‘discussing,’’ detecting bogus news with an F1 score of 96.
‘‘disagreeing,’’ or ‘‘unrelated,’’ was made publicly available. In [43] authors introduced the Multi-integrated Domain
Given the headline ‘‘Robert Plant Ripped up $800M Led Adaptive Supervision (MIDAS) system to automatically
Zeppelin Reunion Contract,’’ the following excerpts illustrate choose the model that best fits a particular collection of data
the categories mentioned, as annotated by the barometer in drawn from random distributions. By using local smoothness
the FNC-1 dataset. as a proxy for accuracy and the relevance of training data,
Body content that conforms to the headline is an instance of MIDAS can increase generalization accuracy across nine
agree class. These topics might be discussed in a discussion distinct fake news datasets. MIDAS has a larger than 10%
class: The article’s main body addresses the same issue as success rate in recognizing bogus news linked to COVID-19,
the title, but does not take a position on the matter. For compared to other labelling methods [43]. The results of the
instance, when comparing the headline and body content, literature review were summarized in Table 1.
one might say they belong to different classes. The FNC-1
competition had 200 entries, the top 10% of which averaged
82% relative points. The group developed a basic criterion III. PROPOSED METHODOLOGY
using just hand-coded features and a Gradient Boosting This section describes a comprehensive detail about the pro-
Classifier, both freely accessible on GitHub. Top systems posed approach. The proposed approach comprises multiple
were UCLMR [43], Talos [44], and the Athene system [23]. steps of data analysis, feature extraction, single classifier, and
The CNNs utilised by Talos [44] were one-dimensional, the ensemble classifier classification, as shown in Figures 4.
active at the word level, and trained using Google News The challenge of fake news in stage 1, a particular purpose
topic vectors for the article’s main body and title. The data and dataset is presented to handle the difficulty of identifying
from the CNN is then fed into a multi-layer perceptron fake news. The challenge’s primary motivation is to build
(MLP) model that generates one of four possible classes of a semi-automated pipeline that examines the attitude of
results. Next, it undergoes a comprehensive, start-to-finish several news items on a specific topic. Thus, the dataset
training process. The system won the FNC-1 competition comprises occurrences with a title, article body, and one
with its superior performance using the CNN-MLP combo. of the four labels ‘‘Disagree’’, ‘‘Agree’’, ‘‘Unrelated’’, and
In recent trials, several research have employed FNC-1 with ‘‘Discuss’’. Figure 4 summarizes our proposed approach,
encouraging outcomes. For instance, [45] suggested a tree- which consists of the steps to achieve fake news classification
like structure for the linked classes by combining the existing by solving multi-class labels. The first part explains the
disagree, agree, and discuss ones. This approach uses a two- corpus creation technique by combining stances and bodies
layer neural network to learn a hierarchical representation of based on news article ids. The second phase describes the
classes, achieving a weighted accuracy of 88.0%. preprocessing processes done on news article text. The
Additionally, scholars built a stance detection model third phase demonstrates techniques to feature selection or
using accomplishment transfer learning on a Roberta Deep dimensionality reduction. The fourth stage describes each
Bidirectional Transformer Language Model. They achieved ML and ensemble model used in this study. Finally, the last
a weighted accuracy of 90.01% by employing Bidirectional phase outlines this study’s various ensemble learning models.
Cross Attention between claim article pairings via pair We divide the dataset into two parts for experiments: training
encoding with self-attention [46]. Further work should be and testing. The training dataset comprises 75% of the data,
done on posture identification problems, such as linking a whereas the testing dataset contains 25%.

29454 VOLUME 11, 2023


A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning

TABLE 1. Literature review summary.

A. DATASET Challenge Stage 1 (FNC-1) to investigate the potential of


Carnegie Mellon University adjunct professor dean Pomer- machine learning and natural language processing in the fight
leau, Joostware, and the AI Research Corporation founder against fake news [27]. This issue was the driving force for the
Delip Rao hosted a competition called the Fake News competition, which focused on stance detection. This section

VOLUME 11, 2023 29455


A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning

TABLE 1. (Continued.) Literature review summary.

provides an overview of the competition dataset, the baseline multiple sources is difficult due to a lack of linguistic
used by the FNC-1 organisers, and the winning strategies used resources. Furthermore, annotating these news pieces based
throughout the competition. on their contents necessitates specialist expertise, a signif-
It ensued by turning a news story into a headline, icant amount of time, and substantial money. As a result,
then annotated the title and using the story to show augmented corpus design is the only way to conduct fake
where they stood on the assertion they introduced. For news detection research. Our augmented corpus is created
this attitude categorization exercise, we have three possible by combining 49972 stances with 1683 bodies based on ids.
sets of labels: ‘‘for,’’ ‘‘against,’’ and ‘‘observing.’’ The The corpus has four distinct classes (agree, disagree, discuss,
developing dataset [27] is the basis for the FNC-1 competition unrelated). It contains 8909 discuss stances, 36545 unrelated
dataset. To create the FNC-1 dataset, we randomly match stances, 3678 stances, and 840 disagree stances. After
headlines and articles from the emerging dataset depending gathering headlines and articles in one column, the final
on their attitude toward the linked allegation. In addition, corpus contains text and stances.
the headlines and articles are separated into related and
unrelated groups. Second, and more difficult, the collection C. PRE-PROCESSING
of connected headline-article pairings is further split into
Data mining relies heavily on pre-processing. It converts
the three classes disagree, agree, and discuss, allowing for
inconsistent and incomplete raw data into a machine-readable
supervision of the job of evaluating the attitude of an article
representation. Various text preprocessing activities were
relative to the assertion presented in the associated headline.
conducted on the FNC-1 dataset. To complete these tasks,
There are 49,972 headline-article pairs in the training set of
NLP approaches such as character conversion to lowercase
the FNC-1 dataset, and another set of pairs in the test set.
letters, stop word elimination, stemming, and tokenization,
There are 1,689 distinct headlines and 1,648 unique articles
as well as algorithms from keras library were used. Stop
used to build the headline-article pairings that make up the
words, which comprise words like ‘‘the, of, there,’’ etc.,
training set. The test set includes 904 distinct articles and
are the most commonly used words in our daily language
894 unique headlines. Seventy-three percent are classified as
and typically have relatively limited significance in terms
unrelated, 7.4 percent as agreeing, 1.7 percent as disagreeing,
of the entire context of the phrase. By removing the stop
and 17.8 percent as debating. About 72.2 percent of the test
words, we save time and space that would otherwise be
data is irrelevant; 7.4 percent is in agreement; 2.7 percent is
consumed by the useless phrases mentioned before. Words
in disagreement; and 17.6 percent is up for discussion. The
with comparable meanings may appear in the text many
training set has 40,350 headline-article sets, the hold-out set
times. For example, ‘‘eating’’ in any sentence will become
has 9,622, and the claim set has 25,413 sets.
‘‘eats’’. Reducing the language to its most basic form can help
if that’s the case. This operation, known as stemming [51],
B. CORPUS DESIGN uses an open-source version of the NLTK’s Porter stemmer
The dataset FCN-1 has four distinct classes (agree, disagree, method. Few preprocessing steps are as follows:
discuss, unrelated). In pre-processing, labels are encoded into 1) Stop Word Removal: Languages commonly use a
numeric target values and perform some pre-processing steps. group of terms collectively known as ‘‘stop words.’’
Preprocessed data is split into 75% data for training and 25% The words ‘‘a,’’ ‘‘the,’’ ‘‘is,’’ and ‘‘are’’ are all examples
for testing. of stop words in English. Stop words are common in
This study used the FNC-1 dataset, consisting of two text mining and natural language processing (NLP) to
CSV files, including stances and body corpora of text news weed out overused words and thus contain little useful
stories written in English. Collecting news stories from information. NLTK provides the stop word dictionary

29456 VOLUME 11, 2023


A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning

in this instance. To begin, the text is cleaned up by avoiding creating a global term-to-index map is preferable
removing all stop words. It is possible to remove stop because doing so can be time-consuming and expensive.
words from the text because they are more common However, this method is vulnerable to hash collisions, which
and carry less useful information. Some common stop occur when different raw features are hashed into the same
words include the conjunctions ‘and’ ’or’ and ‘but’. term. Increasing the number of buckets in the hash table
Pre-processing data is essential in natural language to reduce the likelihood of collisions is recommended. A
processing because processing these less frequently simple modulo determines the vector index on the hashed
used full words consumes a significant amount of time. value, so the feature size should be a power of two. If the
2) Punctuation Removal: The grammatical context of a feature size is smaller than this, the vector indices will not
sentence is provided by natural language punctuation. be evenly distributed. There is a binary toggle parameter that
A comma, for example, may not add anything to the controls the frequency of terms. When this value is true, all
understanding of the statement. nonzero frequency counts are reset to 1. As a result, discrete
3) Link Removal: This step removes hypertext links from probability models are built that do not use integer counts but
social media posts. Regular expressions are used to do rather binary ones.
this.
4) Lemmatization or stemming: Either lemmatization 2) IDF
or stemming is done during this step. The NLTK’s Inverse Document Frequency (IDF) is a calculation fre-
WordNet Lemmatizer is used for lemmatization, while quently employed in association with term frequency. The
the NLTK’s Snowball Stemmer implementation is issue with term frequency is that frequent terms are not
used for stemming, based on the Porter2 stemming necessarily the most significant. For example, ‘‘content’’
algorithm [52]. will appear on every web page. IDF is a method for
5) Apart from the above-mentioned pre-processing stages, lowering the weight of frequently occurring words in a corpus
every social media post must go through. Reply (collection of documents). IDF is determined by dividing
removal: Words beginning with @ (primarily used for the total number of documents by the number of documents
Twitter replies) are eliminated in this phase. Regular containing the phrase in the collection. IDF is an Estimator
expressions are also used to do this. that generates an IDF Model after being fitted to a dataset.
6) Lowercase transformation: Every word is converted Feature vectors (typically created by Hashing-TF or count-
to lowercase in this phase to account for variances in vectorizer) are used to scale each IDF model feature [46].
capitalization. It appears to downplay qualities that are common in a corpus.

D. FEATURE EXTRACTION E. CLASSIFICATION MODELS AND PARAMETERS


Feature extraction transforms raw data into numerical SETTINGS
features that can be further processed while preserving the We use the following machine learning techniques to
original data set’s information. It is more effective than just detect irregularities and breakdown of unusual events and
using raw data to train a machine. investigate the effectiveness of our advanced method:
Random Forest (RF): a supervised learning technique
1) HASHINGTF that may be used for classification, retrieval, and other tasks.
The mapped indices are then used to calculate the phrase It generates a few trees to aid in decision-making. It takes
frequency. Bypassing the need for a term-to-index map, a random sample of data, constructs many decision trees
which can be time consuming and expensive for large cor- to forecast each tree, and then votes on the best option.
pora, this method is less susceptible to hash collisions [45], n-estimators = 200, bootstrap = True, criterion =
where multiple raw features are hashed into the same term. Gini, min-samples-split = 2, random-state = 0, and
HashingTF maps a series of phrases to their word frequencies min-samples-leaf = 1 are the parameters for our RF method.
using the hashing method. Using Austin Appleby’s Murmur Logistic Regression (LR): It is a segregated targeted
Hash 3 algorithm, we can now compute the term object’s learning model. A very straightforward ML algorithm
hash code value (MurmurHash3 × 86 32). Since the hash differentiates problems such as noise detection, diabetes
function is translated to a column index using a simple prediction, cancer detection, etc. LR is used to predict
modulo, the features would not be evenly mapped to the probability of target variability [47]. In our application the
columns if the numb-Features input was less than a power parameters of the LR algorithm are Penalty = l2, C = 1.0,
of two. The HashingTF transforms a set of terms into feature reduce rating = 1, solver = lbfgs, max iter = 100 and
vectors of fixed length. Regarding text processing, a ‘‘term verbose = 0.
set’’ could be a collection of words. HashingTF employs the Decision Tree (DT): are extensively used in decision
hashing technique. A hash function transforms a raw attribute analysis and machine learning [21]. It’s a decision-making
into an index (term). Murmur-Hash-3 is the hash function tool that uses a tree-like graph of decisions and consequences,
in use here. The mapped indices are then used to calculate such as random event outcomes, resource costs, and utility,
the phrase frequency. When working with large datasets, to make judgments. Internal nodes in a DT express a

VOLUME 11, 2023 29457


A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning

condition about an attribute. Each internal node divides that they did not. When the actual class size differs from the
into branches depending on the condition’s outcome until projected class size, false positives and negatives occur.
it reaches a point where it no longer splits and leads When the expected class is present, but the real class is
to leaf nodes, which indicate the class label that will be not, this is called a false positive (FP). The actual class will
applied [48]. be utilized if, for example, it shows that the passenger did
Ensemble Classifier: In addition to the custom classifiers, not survive but the fore-cast class predicts that the passenger
an ensemble technique was developed, which combined would. In cases when the true class is yes but the predicted
the three custom classifiers. The objective is to develop a class is no, a false negative has occurred. If, for example, the
voting classifier that calculates the weights to apply to each actual class value reveals that the passenger lived whereas the
classifier’s prediction [53]. The probabilities computed by expected class value predicted that they would die, the actual
the classifiers are first stored in a matrix for each training class value would be utilized.
instance, resulting in each training case being linked with a To verify the usefulness of the model, the following
probability vector. The weights are calculated, and the final assessment criteria are used:
label is created using this matrix of vectors, which is then fed Precision is the proportion of actual test results that were
into a Meta classifier model (0, 1, 2, or 3). predicted correctly. This is calculated by dividing the number
In contrast to the ensemble model, a voting classifier was of correct predictions by the number of incorrect ones.
also constructed to perform simple majority voting among TP + TN
the models’ predictions. Ensemble categorization is generally Acc =
TP + FN + TN + FN
divided into two stages: base-level and ensemble-level. This
Precision: To calculate a classifier’s precision, divide the
base predictors employ the HashingTF with IDF received
number of positive outcomes by the number of positive
from news articles as input. The output predictions from
predictions.
these base-predictors are fed into ensemble-level models. The
ensemble model’s main purpose is to improve the overall TP
Pr =
prediction F1 score by overcoming the shortcomings of the TP + FP
primary predictors. We have used stacking ensemble models Recall: The total number of positive outcomes divided by
for ensemble classification [54]. the total number of predicted positive outcomes is used to
determine recall.
TP
F. EVALUATION METRICS Re =
TP + FP
The main concern is determining the model’s ability to
discern true and false news. We used metrics to properly F1-score: It is a great way to test accuracy and recall
examine the model’s efficiency for this difficult challenge. simultaneously. This value is used to gauge accuracy and
Model selection and implementation are essential but should recall.
not take precedence over the rest of the project. Various 2 × (Precision × Recall)
F1 =
assessment measures are used to test data to assess the Precision + Recall
model’s capacity to detect false news. Multiple evaluation The accuracy of a prediction may be measured with the
metrics, such as classification reports (accuracy, precision, use of a classification report (CR). Correct and incorrect
recall, F1-score) and confusion measures, may be used to classifications for each category are utilized to determine
assess machine learning models. The sections that follow the totals. False positive (FP), true negative (FN), and
go through each of the assessment measures in detail. Pre- false positive and negative (FP/FN) are widely used in the
processing and other ways of gathering fake news data classification report’s construction (FP&N). Several metrics
are loaded into a strong algorithm, producing incredible may be used to evaluate a model’s efficacy, but accuracy is
results [49]. often prioritized. For example, it incorporates a wide range
Observations that match the predictions made by the model of assessment tools including as (accuracy, precision, recall,
are true positives and negatives, respectively, and are marked F1 score, and support.) The backing indicates the number
in green. Because we would want to cut down on both types of of occurrences for each class [50]. It represents how much
errors, the ones we are trying to minimize are marked below. information out of the total possible may be calculated with
These phrases don’t make a lot of sense. So, we can check high precision. Number of courses where just the best features
our understanding by dissecting each statement. were recalled. An equation may be used to depict this. To get
A True Positive (TP) is a correctly anticipated positive the F1-score, we add the percentage of correct predictions
result when the actual and projected class values are yes. For and the number of correct recalls. The table summarizes the
instance, if the expected and actual class values suggest that mean weighted recall and accuracy for a certain sample. The
the passenger made it, we know they did. When both the F1-score for this model is 1, which means it is ideal.
actual and anticipated class values are negative, we say the ‘‘Support’’ refers to the number of class occurrences in a
value is a True Negative (TN). For instance, this passenger given dataset. The word ‘‘accuracy’’ refers to the proportion
did not survive if both the actual and predicted classes suggest of correct predictions relative to the number of potential ones.

29458 VOLUME 11, 2023


A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning

TABLE 2. Proposed approach results.

IV. EXPERIMENTS AND RESULTS precision, 91.24% recall, and 90.45% F1-score. As compared
A. CLASSIFICATION RESULTS to LR with count vectorizer, the SVM obtained high results.
The experimental results of Term Frequency-Inverse Docu- We also employed LR and SVM models with HashingTF-IDF
ment Frequency (TF-IDF) and HashingTF feature extraction features. The results of LR with HashingTF-IDF are better
techniques with ensemble models are presented in Table 2. than the SVM model. Compared to LR, the SVM model with
The results using HashTF and IDF features regarding HashingTF-IDF achieved 90.75% accuracy.
accuracy, precision, recall, and F1-score are 93.45%, 92.03%, The LR model with HashingTF-IDF obtained 93.78%
92.45%, and 92.25%. The results from LR_HashingTF-IDF accuracy, which is higher than the SVM model’s accuracy.
is 93.45%, and it’s a highest as compared to all other At the end we utilized Trigram, Unigram + Bigram +
experimental. Furthermore, Bigram Logistic Regression Trigram, Unigram + Bigram + Trigram + 16000 limited
exhibits 88.45% accuracies, 87.02% precision, 88.01% recall, top features and Unigram + Bigram + Trigram + Cv +
and 87.06% F1-score. We also performed experiments using IDF + Chiseq feature with Logistic Regression to efficiently
glove word embedding. We used the glove embedding detect fake news. The LR with Trigram obtains significant
technique with logistic regression. However, the glove with results: accuracy is 83.47%, precision is 82.01%, recall
logistic regression model results is not so high but quite is 83.45%, and F1-score is 82.64%. While compared to
well with accuracy scores of 73.25% and 63.12%, 73.25%, individual Trigram features, the LR model with Uni, Bi,
62.45% as the precision, recall, and F1-score. To make and Trigram obtained better results with 88.64% accuracy.
a broader comparison, we include features of the count However, when running tests with Uni, Bi, Trigram, and
vectorizer technique. The features of the count vectorizer 16000 limited top features, the LR model obtained less
were passed to logistic regression to detect fake news. Using accuracy, which is 83.78%. Ultimately, we tried to merge all
the count vectorizer technique, the logistic model achieved the features Unigram + Bigram + Trigram + Cv + IDF +
88.45% accuracy, 82.12% precision, 88.45% recall, and Chiseq, applied LR on these features, and obtained promising
87.35% F1 score. Moreover, we merged the count vectorizer results with 83.45% accuracy and accuracy and 82.45%
and TF-IDF features to obtain better results, but we failed to F1-score.
avail improved results due to the high computational cost. The Figure 5 (a) shows the classification report of ensem-
The correctness, precision, recall, and F1-score using count ble model. The support presents the number of instances of
vectorizer and TF-IDF features with logistic regression are each class in testing set. 12,403 instances are used for testing
84.54%, 83.12%, 84.25%, and 83.26%. We also employed data. We used weighted accuracy to calculate the precision,
the Support Vector Machine (SVM) model to testify its recall and F1-score because it deals with the class imbalance
abilities using count vectorizer features, and the SVM problem. The mean average precision, recall, F1-score of all
model gets improved results with 91.75% accuracy, 91.25% classes is calculated using macro average, while weighted

VOLUME 11, 2023 29459


A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning

TABLE 3. Comparative analysis of proposed and baseline approaches.

average is the total number of TP divided by the entire model outperforms the baseline highest F1 score, which is
number of objects in all classes. Macro average stands for 83.10%, while the proposed approach obtained the highest
mean average. The weighted average score is higher due to F1 score of 93.84%. In addition, dealing with the class-
the class unbalancing in the dataset. We also construct the wise score, the baseline approach of [46] exhibits the best
ensemble model’s confusion matrix, as shown in Figure 5(b). score for Agree class with 73.76%. The proposed approach
A confusion matrix, also known as an error matrix, is a with TF-IDF features and LR model achieved the highest
table that visually depicts the performance of a supervised agree with class score of 80.23%. The proposed approach
classification machine learning system. Figure 5 (b) shows outperforms the baseline regarding the F1 score, with the
that the model made multiple incorrect classifications. The highest F1 score of 92.45% and improved 9.35 %.
ensemble model’s ultimate accuracy on testing data is 93%.
C. DISCUSSION
B. PERFORMANCE COMPARISON OF DIFFERENT The FNC-1 dataset, which contains 49,972 headline articles
APPROACHES and four distinct categories, was used to achieve the inves-
The comparative analysis of proposed approaches with tigation’s objectives, and obtain the desired results (discuss,
various baseline approaches is presented in Table 3. The bold agree, unrelated, and disagree). The proposed system com-
values manifest the highest achieved score of proposed and prises numerous components, such as data pre-processing,
baseline approaches. The experimental setting of proposed visualization, exploratory analysis, feature extraction, and
approaches resembled the baseline. It is shown in the table 3 classification using machine learning strategies. We proposed
that the proposed approach with TF-IDF features and LR classifying data with an ensemble model influenced by

29460 VOLUME 11, 2023


A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning

deciding whether or not to employ PySpark. Compared to the


other approaches utilized during this inquiry and the previous
baseline studies, the suggested ensemble model had the
greatest F1 score. The proposed ensemble model exhibits the
highest F1 score compared to the existing baseline studies and
other approaches used in this study. This model achieved the
highest F1 score of 92.45% due to the features of HashingTF-
IDF that were added during development. We boosted our F1
score by 9.35%, which is a sufficient gain to prove the novelty
of this research.
In the future, one of our long-term goals is to use Spark to
implement deep learning models in a multi-agent distributed
learning environment. These algorithms will be used to detect
instances of fake news. As a result, we can assess the
effectiveness of a wide range of machine learning and deep
learning algorithms on a diverse set of fabricated news stories.
Furthermore, we intend to create a featured ensemble of
different embedding techniques alongside different machine
learning and deep learning models capable of accurately
recognizing and categorizing various hoaxes and fake news.
This will be done so that we may better understand how to
spot false news, which will not only aid in understanding the
patterns of detecting hoax or fake news but also in developing
a cutting-edge real-time fake news detection system.

V. CONCLUSION
FIGURE 5. (a): Classification report of final ensemble model Headline stance checker has been indicated to be a helpful
(b): Confusion matrix of final ensemble model.
method for exposing falsehood in the news, particularly when
a headline is contrasted to its content body. To demonstrate
machine learning in real time during the experiment. As a the applicability of the headline stance checker, various tests
direct result, a more rapid interpretation of the findings is were conducted in the context of an existing assignment (Fake
now possible. Instead of just one, two, or three different clas- News Challenge FNC-1). The stance of a headline had to
sification methods, the proposed ensemble model employs be categorized into one of the following classes: disagree,
three distinct machine learning approaches (Random Forest, agree, unrelated, and discuss. The studies included verifying
Logistic Regression, and Decision Tree). This ensemble each of the suggested classification steps separately and the
model was created as part of our efforts to improve our overall method is evaluated by comparing the state-of-the-
previous investigations into identifying and categorizing fake art in this job. In this study, researchers used the dataset
news. FNC-1, which has categorized fake news into four categories,
Several different factors are influencing the current situ- while using big data technology (Spark) to perform machine
ation. Several experiments are being carried out using the learning analysis for assessment and comparison with other
Apache Spark framework to handle big data and perform state-of-the-art approaches in fake news identification. The
classification task. These experiments were carried out to suggested approach created a stacked ensemble model and
improve our ability to detect fake news. As a result of these experimented with it on a distributed Spark cluster. We used
experiments, our ability to recognize hoaxes and other forms N-grams, HashingTF-IDF, and count vectorizer for feature
of disinformation should be enhanced. The model’s was one extraction, followed by the suggested stacked ensemble
of the aspects considered during the evaluation process for classification model. Compared to the baseline techniques’
this particular piece of research. The model’s accuracy was results, the suggested model has a high classification
also considered part of the evaluation process, in addition performance of 92.45% in F1-score. The suggested model
to its performance compared to five other distinct criteria. outperforms the previous baseline techniques and improves
Different evaluation metrics include accuracy, precision, the F1 score significantly. The suggested ensemble model
recall, the F1-score, and the confusion matrix to test the improves the F1 score by 9.35%.
model’s performance.
PySpark was chosen because it uses RDD, significantly A. RECOMMENDATIONS FOR FURTHER WORK
accelerating computation processing. As a result, the compu- We currently work with a supervised approach, but
tations were finished significantly faster than they otherwise researchers can work with unsupervised fake news detection
would have been. This was the essential consideration in in the future. This proposed work can also be extended

VOLUME 11, 2023 29461


A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning

using various neural network-based models, which are more [19] J. Weedon, W. Nuland, and A. Stamos, Information Operations and
sufficient for unsupervised fake news detection. Spark takes Facebook. Menlo Park, CA, USA: Facebook, 2017.
[20] A. Vlachos and S. Riedel, ‘‘Identification and verification of simple claims
too much training time due to the standalone cluster. Due about statistical properties,’’ in Proc. Conf. Empirical Methods Natural
to the solitary cluster, Spark takes twice as long to train. Lang. Process., 2015, pp. 2596–2601.
In future, researchers can perform experiments on creating [21] S. N. Shorabeh, N. N. Samany, F. Minaei, H. K. Firozjaei, M. Homaee, and
A. D. Boloorani, ‘‘A decision model based on decision tree and particle
a cluster on a different machine. This research may be further swarm optimization algorithms to identify optimal locations for solar
stretched by employing various neural network-based models power plants construction in Iran,’’ Renew. Energy, vol. 187, pp. 56–67,
better suitable for unsupervised fake news identification. 2022.
[22] E. Zotova, R. Agerri, and G. Rigau, ‘‘Semi-automatic generation of
We will try to build a cluster on a separate computer. multilingual datasets for stance detection in Twitter,’’ Expert Syst. Appl.,
vol. 170, May 2021, Art. no. 114547.
ACKNOWLEDGMENT [23] S. Mishra, P. Shukla, and R. Agarwal, ‘‘Analyzing machine learning
enabled fake news detection techniques for diversified datasets,’’ Wireless
We would like to thank Researchers Supporting Project Commun. Mobile Comput., vol. 2022, pp. 1–18, Mar. 2022.
number (RSPD2023R532) King Saud University, Riyadh, [24] A. Spark, ‘‘Apache spark,’’ Retrieved January, vol. 17, no. 1, p. 2018, 2018.
Saudi Arabia. [25] A. Sen, M. Sinha, S. Mannarswamy, and S. Roy, ‘‘Stance classification of
multi-perspective consumer health information,’’ in Proc. ACM India Joint
Int. Conf. Data Sci. Manage. Data, Jan. 2018, pp. 273–281.
REFERENCES [26] S. V. Vychegzhanin and E. V. Kotelnikov, ‘‘Stance detection based
[1] P. H. A. Faustini and T. F. Covões, ‘‘Fake news detection in multiple on ensembles of classifiers,’’ Program. Comput. Softw., vol. 45, no. 5,
platforms and languages,’’ Expert Syst. Appl., vol. 158, Nov. 2020, pp. 228–240, Sep. 2019.
Art. no. 113503. [27] C. Silverman, ‘‘Lies, damn lies and viral content,’’ Tow Center Digit.
[2] M. D. Vicario, W. Quattrociocchi, A. Scala, and F. Zollo, ‘‘Polarization Journalism, Columbia Univ., New York, NY, USA, 2015.
and fake news: Early warning of potential misinformation targets,’’ ACM [28] S. Harabagiu, A. Hickl, and F. Lacatusu, ‘‘Negation, contrast and
Trans. Web, vol. 13, no. 2, pp. 1–22, May 2019. contradiction in text processing,’’ in Proc. AAAI, vol. 6, 2006, pp. 755–762.
[3] Y. Liu and Y.-F.-B. Wu, ‘‘FNED: A deep network for fake news early [29] S. Mohammad, S. Kiritchenko, P. Sobhani, X. Zhu, and C. Cherry,
detection on social media,’’ ACM Trans. Inf. Syst., vol. 38, no. 3, pp. 1–33, ‘‘SemEval-2016 task 6: Detecting stance in tweets,’’ in Proc. 10th Int.
Jul. 2020. Workshop Semantic Eval. (SemEval). San Diego, CA, USA: Association
[4] J. C. S. Reis, A. Correia, F. Murai, A. Veloso, and F. Benevenuto, for Computational Linguistics, 2016, pp. 31–41.
‘‘Supervised learning for fake news detection,’’ IEEE Intell. Syst., vol. 34, [30] B. G. Patra, D. Das, and S. Bandyopadhyay, ‘‘JU_NLP at SemEval-2016
no. 2, pp. 76–81, Mar. 2019. task 6: Detecting stance in tweets using support vector machines,’’ in Proc.
10th Int. Workshop Semantic Eval. (SemEval), 2016, pp. 440–444.
[5] M. Z. Asghar, A. Habib, A. Habib, A. Khan, R. Ali, and A. Khattak,
‘‘Exploring deep neural networks for rumor detection,’’ J. Ambient Intell. [31] H. Elfardy and M. Diab, ‘‘CU-GWU perspective at SemEval-2016 task 6:
Hum. Comput., vol. 12, no. 4, pp. 4315–4333, Apr. 2021. Ideological stance detection in informal text,’’ in Proc. 10th Int. Workshop
Semantic Eval. (SemEval), 2016, pp. 434–439.
[6] R. K. Kaliyar, A. Goswami, and P. Narang, ‘‘DeepFakE: Improving fake
news detection using tensor decomposition-based deep neural network,’’ [32] I. Augenstein, T. Rocktäschel, A. Vlachos, and K. Bontcheva,
J. Supercomput., vol. 77, no. 2, pp. 1015–1037, Feb. 2021. ‘‘Stance detection with bidirectional conditional encoding,’’ 2016,
arXiv:1606.05464.
[7] S. S. Jadhav and S. D. Thepade, ‘‘Fake news identification and classifica-
[33] P. Wei, W. Mao, and D. Zeng, ‘‘A target-guided neural memory model
tion using DSSM and improved recurrent neural network classifier,’’ Appl.
for stance detection in Twitter,’’ in Proc. Int. Joint Conf. Neural Netw.
Artif. Intell., vol. 33, no. 12, pp. 1058–1068, Oct. 2019.
(IJCNN), Jul. 2018, pp. 1–8.
[8] A. Vereshchaka, S. Cosimini, and W. Dong, ‘‘Analyzing and distinguishing
[34] S. Zhou, J. Lin, L. Tan, and X. Liu, ‘‘Condensed convolution neural
fake and real news to mitigate the problem of disinformation,’’ Comput.
network by attention over self-attention for stance detection in Twitter,’’
Math. Org. Theory, vol. 26, no. 3, pp. 350–364, Sep. 2020.
in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2019, pp. 1–8.
[9] F. Monti, F. Frasca, D. Eynard, D. Mannion, and M. M. Bronstein, ‘‘Fake
[35] M. Taulé, M. A. Martí, F. M. Rangel, P. Rosso, C. Bosco, and V. Patti,
news detection on social media using geometric deep learning,’’ 2019,
‘‘Overview of the task on stance and gender detection in tweets on Catalan
arXiv:1902.06673.
independence at IberEval 2017,’’ in Proc. 2nd Workshop Eval. Hum. Lang.
[10] M. H. Goldani, S. Momtazi, and R. Safabakhsh, ‘‘Detecting fake news with Technol. Iberian Lang. (CEUR-WS), vol. 1881, 2017, pp. 157–177.
capsule neural networks,’’ 2020, arXiv:2002.01030.
[36] M. Lai, A. T. Cignarella, D. I. Hernández Farías, C. Bosco, V. Patti, and
[11] S. Shellenbarger, ‘‘Most students don’t know when news is fake, Stanford P. Rosso, ‘‘Multilingual stance detection in social media political debates,’’
study finds,’’ Wall Street J., vol. 21, 2016. Comput. Speech Lang., vol. 63, Sep. 2020, Art. no. 101075.
[12] D. Pierson, ‘‘Facebook and Google pledged to stop fake news. So why did [37] S. Sommariva, C. Vamos, A. Mantzarlis, L. U.-L. Dào, and D. M. Tyson,
they promote Las Vegas-shooting hoaxes?’’ Los Angeles Times, Oct. 2017. ‘‘Spreading the (fake) news: Exploring health messages on social media
[13] G. Zarrella and A. Marsh, ‘‘MITRE at SemEval-2016 task 6: Transfer and the implications for health professionals using a case study,’’ Amer. J.
learning for stance detection,’’ 2016, arXiv:1606.03784. Health Educ., vol. 49, no. 4, pp. 246–255, Jul. 2018.
[14] S. Ghosh, P. Singhania, S. Singh, K. Rudra, and S. Ghosh, ‘‘Stance [38] B. Riedel, I. Augenstein, G. P. Spithourakis, and S. Riedel, ‘‘A simple but
detection in web and social media: A comparative study,’’ in Proc. Int. tough-to-beat baseline for the Fake News Challenge stance detection task,’’
Conf. Cross-Lang. Eval. Forum Eur. Lang. Cham, Switzerland: Springer, pp. 1–6, May 2018, arXiv:1707.03264.
2019, pp. 75–87. [39] Q. Zhang, S. Liang, A. Lipani, Z. Ren, and E. Yilmaz, ‘‘From stances’
[15] A. I. Al-Ghadir, A. M. Azmi, and A. Hussain, ‘‘A novel approach to stance imbalance to their hierarchical representation and detection,’’ in Proc.
detection in social media tweets by fusing ranked lists and sentiments,’’ World Wide Web Conf., May 2019, pp. 2323–2332.
Inf. Fusion, vol. 67, pp. 29–40, Mar. 2021. [40] C. Dulhanty, J. L. Deglint, I. B. Daya, and A. Wong, ‘‘Taking a stance
[16] S. Somasundaran and J. Wiebe, ‘‘Recognizing stances in ideological on- on fake news: Towards automatic disinformation assessment via deep
line debates,’’ in Proc. NAACL HLT Workshop Comput. Approaches Anal. bidirectional transformer language models for stance detection,’’ 2019,
Gener. Emotion Text, 2010, pp. 116–124. arXiv:1911.11951.
[17] A. Konjengbam, S. Ghosh, N. Kumar, and M. Singh, ‘‘Debate stance [41] B. Pouliquen, R. Steinberger, and C. Best, ‘‘Automatic detection of
classification using word embeddings,’’ in Proc. Int. Conf. Big Data Anal. quotations in multilingual news,’’ in Proc. Recent Adv. Natural Lang.
Knowl. Discovery. Cham, Switzerland: Springer, 2018, pp. 382–395. Process., 2007, pp. 487–492.
[18] A. Faulkner, ‘‘Automated classification of stance in student essays: An [42] M.-C. De Marneffe, A. N. Rafferty, and C. D. Manning, ‘‘Finding
approach using stance target information and the Wikipedia link-based contradictions in text,’’ in Proc. Assoc. Comput. Linguistics, 2008,
measure,’’ in Proc. 27th Int. Flairs Conf., May 2014. pp. 1039–1047.

29462 VOLUME 11, 2023


A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning

[43] A. Suprem and C. Pu, ‘‘MiDAS: Multi-integrated domain adaptive [55] R. Agrawal, S. Rajagopalan, R. Srikant, and Y. Xu, ‘‘Mining newsgroups
supervision for fake news detection,’’ 2022, arXiv:2205.09817. using networks arising from social behavior,’’ in Proc. 12th Int. Conf.
[44] O. Levy, T. Zesch, I. Dagan, and I. Gurevych, ‘‘Recognizing partial textual World Wide Web, 2003, pp. 529–535.
entailment,’’ in Proc. 51st Annu. Meeting Assoc. Comput. Linguistics, [56] A. Murakami and R. Raymond, ‘‘Support or oppose? Classifying positions
vol. 2, 2013, pp. 451–455. in online debates from reply activities and opinion expressions,’’ in Proc.
[45] C. Cai and D. Lin, ‘‘Find another me across the world - large-scale semantic 23rd Int. Conf. Comput. Linguistics, 2010, pp. 869–875.
trajectory analysis using spark,’’ 2022, arXiv:2204.00878. [57] G. Gorrell, E. Kochkina, M. Liakata, A. Aker, A. Zubiaga, K. Bontcheva,
[46] M. D. N. Darji, S. M. Parikh, and H. R. Patel, ‘‘Sentiment analysis and L. Derczynski, ‘‘SemEval-2019 task 7: RumourEval, determining
of unstructured data using spark for predicting stock market price rumour veracity and support for rumours,’’ in Proc. 13th Int. Workshop
movement,’’ in Inventive Computation and Information Technologies. Semantic Eval., 2019, pp. 845–854.
Singapore: Springer, 2022, pp. 521–530.
[47] A. A. H. Ahmadini, ‘‘A novel technique for parameter estimation in
intuitionistic fuzzy logistic regression model,’’ Ain Shams Eng. J., vol. 13,
no. 1, Jan. 2022, Art. no. 101518.
[48] S. Choe, A. Ha, J. W. Jeoung, K. H. Park, and Y. K. Kim, ‘‘Macular ALAA ALTHENEYAN received the B.Ed. degree in computer and education
sector-wise decision tree model for the prediction of parafoveal scotoma
and the M.S. and Ph.D. degrees in computer science from King Saud
not detected by 24–2 visual field test,’’ Clin. Exp. Ophthalmol., vol. 50,
University, Riyadh, Saudi Arabia, in 2006, 2012, and 2020, respectively.
no. 5, pp. 510–521, Jul. 2022.
[49] S. Visa, B. Ramsay, A. L. Ralescu, and E. Van Der Knaap, ‘‘Confusion
From 2011 to 2021, she was a Lecturer with King Saud University, where
matrix-based feature selection,’’ MAICS vol. 710, no. 1, pp. 120–127, she has been an Assistant Professor with the Computer and Engineering
2011. Department, since 2021. Her research interests include natural language
[50] A. Basarkar, ‘‘Document classification using machine learning,’’ San José processing and machine learning.
State Univ., San Jose, CA, USA, 2017.
[51] E. Loper and S. Bird, ‘‘NLTK: The natural language toolkit,’’ 2002,
arXiv:cs/0205028.
[52] C. Li, A. Porco, and D. Goldwasser, ‘‘Structured representation learning
ASEEL ALHADLAQ received the B.S. and M.S. degrees in computer science
for online debate stance prediction,’’ in Proc. 27th Int. Conf. Comput.
Linguistics, 2018, pp. 3728–3739. from King Saud University, Riyadh, Saudi Arabia, in 2006 and 2013,
[53] R. Sandrilla and M. S. Devi, ‘‘FNU-BiCNN: Fake news and fake URL respectively, and the Ph.D. degree in computing from Newcastle University,
detection using bi-CNN,’’ Int. J. Adv. Comput. Sci. Appl., vol. 13, no. 2, Newcastle upon Tyne, U.K., in 2021.
p. 477, 2022. From 2011 to 2021, she was a Lecturer with King Saud University, where
[54] A. Abbasi, A. R. Javed, C. Chakraborty, J. Nebhen, W. Zehra, and she has been an Assistant Professor with the Computer and Engineering
Z. Jalil, ‘‘ElStream: An ensemble learning approach for concept drift Department, since 2021. Her research interests include human–computer
detection in dynamic social big data stream learning,’’ IEEE Access, vol. 9, interaction, social media, and designs.
pp. 66408–66419, 2021.

VOLUME 11, 2023 29463

You might also like