Big Data ML-Based Fake News Detection Using Distributed Learning
Big Data ML-Based Fake News Detection Using Distributed Learning
ABSTRACT Users rely heavily on social media to consume and share news, facilitating the mass dis-
semination of genuine and fake stories. The proliferation of misinformation on various social media
platforms has serious consequences for society. The inability to differentiate between the several forms of
false news on Twitter is a major obstacle to effective detection of fake news. Researchers have made
progress toward a solution by emphasizing methods for identifying fake news. The dataset FNC-1, which
includes four categories for identifying false news, will be used in this study. The state-of-the-art methods
for spotting fake news are evaluated and compared using big data technology (Spark) and machine learning.
The methodology of this study employed a decentralized Spark cluster to create a stacked ensemble
model. Following feature extraction using N-grams, Hashing TF-IDF, and count vectorizer, we used the
proposed stacked ensemble classification model. The results show that the suggested model has a superior
classification performance of 92.45% in the F1 score compared to the 83.10 % F1 score of the baseline
approach. The proposed model achieved an additional 9.35% F1 score compared to the state-of-the-art
techniques.
INDEX TERMS Big data, machine learning, fake news, ensemble learning, social media.
I. INTRODUCTION
Many automatically assume that the news is either bogus
The use of social media platforms to disseminate and digest
or legitimate based on the article’s content. Techniques
media has increased in recent years. Social networking sites
based on news content use methods for collecting data and
like Facebook and Twitter generate daily data [1]. It is
tone from fake news stories. The goal of style-based
no secret that the internet is a goldmine of information,
methods for de-detecting false news is to utilize the
especially recent news [2]. The proliferation of fake news
manipulators’ writing styles for detection. By examining
is directly attributable to the internet’s user-friendly nature.
certain language features, we can distinguish fake news
Since fake news is often presented as factual, it is often
from the real thing [3]. However, false news is created with
shared on social media. Often, this data is spread for profit
the intent of fooling readers. Thus, improving the detection
or influencing politics. The effects of fake news on society
of false news using news content style is a difficult problem.
as a whole are profound. In the light of its profound
To assist in avoiding the difficult and time-consuming
impacts, fixing this issue is crucial [3]. Multiple instances
human work of fact- checking, the Natural Language
of false news were reported to have spread on social media
Processing (NLP) industry has shown considerable
during the 2016 US elections, including the presidential
interest in automatic recognition of fake news [6], [7].
election and the nomination of a new Air Marshal in India
Determining the integrity of news is a difficult task, even
[4]. The dissemination of false information has negatively
for automated approaches [8]. Familiarizing with what
affected people’s mental health and society as a whole [5].
other news outlets say on the same issue might be a useful
starting point for recognizing false news. Identifying a
The associate editor coordinating the review of this manuscript and
person’s position is the purpose of this phase. Multiple
approving it for publication was Chong Leong Gan .
tasks, such as evaluating online arguments [9], [10],
verifying the integrity of Twitter rumors [11], [12],
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 11, 2023 29447
A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning
FIGURE 1. Overview of the headline and text bodies with their respective stances.
or understanding the argumentation structure of seminal Fools’ jokes, rumors, clickbait, or stated opinions posted
works [13], [14], have traditionally relied on position online with incorrect facts.
identification. In this research work, ‘‘fake news’’ is defined as a
In the first example of evaluating the first False News written article that is manifestly untrue and falsely
Challenge (FNC-1), a media news source discusses a topic disseminated without being authentic mostly accompanird
to create automated fake news detection systems using by malicious intents. This definition includes three
AI technology and machine learning. Almost fifty groups important textual, visual, and audio bases. Other elements
from industry and academics worked on this problem. One such as video-based fake news and audio, are typically
of the objectives of the FNC-1 challenge is to track out ignored when referring to textual fake news; additionally,
a media production dealing with a certain title. It might each element has its linguistic complexities that necessitate
support, challenge, or have nothing to do with the title. Four different machine learning and deep learning algorithms to
potential vantage points from which an essay is to be detect and solve problems such as ‘Deep Fake,’ etc. The
written. The guidelines, dataset, and grading criteria for notion also implies that fake news might be fact-checked, an
the FNC- 1 challenge are all available on their site. These important characteristic. Therefore, the claims may be
topics are further shown in Figure 1, which depicts the checked to see if they are true or false. Because rumors are
results of four distinct research. usually hard to verify, they are deleted from the
Multiple deep learning and Recurrent Neural Networks definition because of this inclusion. Conspiracy theories are
(RNN), as well as their modifications, including Convolution classed as rumors because they are persistent rumors that
Neural Networks (CNN) [15], are often employed for NLP are difficult to refute. False information concerning the
tasks and have shown to perform magnificently on NLP- entertainment sector, including hoaxes and April Fools’
related tasks [16], [17], [18]. gags, is not permitted because the objective must be
harmful. Furthermore, the goal is infamous as it seeks to
A. OVERVIEW OF FAKE NEWS DETECTION affect public opinion in favor of a specific message. It
In 2017, Facebook released a white paper that explored also removes text bits that were mistakenly published
the risks of online communication and the management of improperly, such as transposed numbers.
being one of the most prominent social media platforms A model of the connection between headlines and news
today. Weedon, Nuland, and Stamos also noticed the content is necessary for identifying clickbait. It is also
growing challenge of using the enigmatic phrase ‘‘fake crucial to tell the difference between false news and
news,’’ and proclaimed that ‘‘the overuse and clickbait. The term ‘‘clickbait’’ refers to articles with
misapplication of the term ‘‘fake news’’ might be enticing headlines written to attract online audience or
challenging since we cannot understand or adequately traffic; when people click on such a headline, they end up at
address these concerns without shared definitions’’ [19]. a different website with poorly written articles that have
The word can apply to anything from virtually incorrect nothing to do with the subject line. So, clickbait is written
news articles to deceptions, April with one goal: getting more people to visit a website that
29448 VOLUME 11, 2023
A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning
money. The motive is monetary gain rather than furthering and quantity with which it is distributed have changed:
a political agenda via disseminating false information. social media platforms
A great example is the deliberately spread of false news
about Hillary Clinton by Russian trolls in the 2016
presiden- tial election campaign, which was designed to
affect people’s voting choices away from Hillary and toward
Donald Trump. This instance demonstrates how dangerous
it can be when false information spreads on critical issues.
Of course, there’s another problem with false news, toxic
information is spread for no reason to sow doubt, stir up
chaos, and make it difficult for readers to tell fact from
fiction.
detect and prevent the spread of false news, it is necessary informa- tion has grown and evolved throughout time,
to conduct research into their online behavior. including print,
• Source-Where it takes the news or a piece of news one is lexicon, the second is semantics, then discourse
getting source, who published it, and source is and syntax. Structure-related features are also technique-
authentic or not.
oriented features because most quantification depends on
• Headline-A detailed summary of the news’s quality to
NLP-based methods. The critical challenge at the lexical
entice readers. level is identifying the frequency statistics of a word(s),
• Body Text-It shows the actual story/content of the
news. The most common method for detecting false letter(s), or other entity, which may be done correctly
information is by applying n-gram models. Part-of-Speech (POS)-taggers
to look at the content of the news piece. The substance of a execute shallow syntax tasks at the syntax level, making
news report is generally separated into two types: textual tagging and assessment of POS easier. Probabilistic
and visual. Much of the news material is presented in the Context- Free Grammars (PCFG) analyses Context-Free
textual mode, one of these modalities. As previously said, Grammars (CFG) by performing deep syntax level
fake news consists of manipulating the audience, and it operations with parse trees. On the semantic level, word
does so via the use of specific terminology. Non-fake count (WC) and linguistic inquiry are also utilized to create
news, however, is usually transferred to a separate semantic classes for semantic features.
language list since it is more legitimate. Attribute-based
language characteristics and structure-related language E. PROBLEM FORMULATION
features are two common categories. Developing a Spark distributed cluster-based environment
for efficiently detecting fake news articles via a supervised
2) ATTRIBUTE-BASED LANGUAGE FEATURES learning paradigm necessitated solving two sub-problems.
They involve the ten parallel aspects of content style’s First, our model needed to learn how to recognize and seize
linguistic elements. These aspects involve volume, uncer- necessary information in lengthy and textual news articles
tainty, objectivity, emotions, diversity, and readability [24]. for categorizing the association between news item titles
Although attribute-based language characteristics are gener- and related meta descriptions.
ally extremely important, explainable, and predictable, they
are often useless in assessing deception style compared F. RESEARCH OBJECTIVES
to structure-based features. Furthermore, attributed features In the first section of this research, we examine the effec-
require extra resources for deception detection, which may tiveness of Recurrent Neural Networks (RNN) in modeling
take longer and significantly focus on correct feature news articles to identify the link between an article’s body
evaluation and filtering. content and its title. As part of our research, we use the
3) STRUCTURE-BASED LANGUAGE FEATURES dataset made available for the FNC-1 competition to train
Content style is defined by structure-based linguistic prop- and assess a classifier. We want the classifier to be able to
erties and must have four levels of language: the first do the following.
• Use the Spark framework to research, assess, and efficient stacked ensemble classifier for fake news
compare several machine learning classification techniques detection.
on four classes from the FNC-1 dataset.
In an experiment, we demonstrate that the recommended
• Given a title and an article, determine if the article method can accurately identify fake news and beats current
agrees with, disagrees with, discusses, or is irrelevant to the state of the art algorithms.
assertion made in the headline.
• To propose an efficient, systematic, and functional G. PAPER LAYOUT
approach based on machine learning algorithms for The remaining paper contains the following sections. Related
detecting fake news using Spark and to design an
work is reviewed in section II. The dataset used for
VOLUME 11, 2023 29455
A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning
experimentation and preliminaries is discussed in section view concerning a specified target, such as a single topic,
III. The experimental results and discussion are articulated headline, or even a person [15], [29]. Consequently, there
in section IV. Finally, section V presents a conclusion, and are three factors and a machine learning based
future work. categorization technique to determine how the comparison
occurs. The group’s titles (for example: help, against, for,
II. LITERATURE REVIEW
or neutral) are determined by the issue. Political arguments
This section provides an overview of the previous
[30], [31], articles [32], [33], and even internal company
research’s difficulties in identifying fake news. To identify
dialogues [25], [34], which stretches a wide range of fields
fabricated news stories, it is necessary to do rumor
may be referred to as categories. Detecting the stretch of
detection and identification. It is important to distinguish
Tweets or short texts such as hearsays [35] or
between Real and fake news since both are based on
microblogging accounts has gotten much attention in
deliberate fab- rication. Fake news identification is
opinion mining. ‘‘Hillary Clinton’’ as a celebrity,
particularly difficult when detecting news based on
‘‘Atheism’’ as a specific issue, or the profess that ‘‘E-
characteristics. Tweets and social context can be used to
cigarettes are safer than regular cigarettes’’ are examples of
generate features. As a result, we assess prior work based
objectives presented in the available datasets. Shared tasks
on single-modality and stance identification.
for providing such datasets and promoting research have
emerged in several languages.
A. TEXTUAL CONTENT BASED
The sub-task for exposing stance in Tweets [26] was
Most earlier news identification studies relied mainly on
presented at SemEval-2016, with roughly 5,000 tweets
textual elements and user metadata. Text based features are
in English, including five familiar subjects. The task has
statistically extracted from message text content and have
initiated a variety of approaches, including conventional
been extensively discussed in the literature on fake news
techniques (for example, KNN [36], SVM [22], or essen-
identification. The textual component extracts unique writing
tial attributes given by methods [34]) and deep learning
styles [15], [19], [20] and emotional sensations [18] that are
approaches (e.g., BiLSTM [37], Bidirectional Conditional
prominent in fake news.
Encoding [27], [34]). Furthermore, public datasets, for
Network connections, style analysis, and individual emo-
instance, the Multi-Perspective Consumer Health Query
tions have all been proven to contribute to detecting fake
dataset [38] dedicated to exposing the stance of sentences
news [19]. After reading these posts, [20] explored the
taken from high-quality articles on five separate assertions.
writing style and its effects on readers’ viewpoints and
Like ‘‘Sun exposure causes skin cancer,’’ the dataset is
attitudes. Emotion is a significant predictor in many fake
avail- able to work on the development of new and exciting
news detection studies, and most rely on user positions or
work. It contains an in-depth examination of various
simple statistical emotional features to convey emotion. In
approaches to the two goals listed above. The need for well-
[15] authors introduced a novel dual emotion-based method
interpreted data in languages other than English has rapidly
for identifying fake news that can learn from publishers’
increased notation efforts and collaborative tasks aimed at
and users’ content, user comments, and emotional
furthering research. There are efforts like Stance-Cat, an
representation. Reference [25] employed an ML model for
aim for identifying attitudes in Spanish and Catalan tweets
identifying fake news that employs convolution filters to
[39], a proposal and database of brief statements in Russian
distinguish between different granularities of text
online forums [40], and even projects that integrate several
information. They investigated the issue of posture
languages [41].
categorization in an innovative approach to consumer
A group of volunteers from industry and academia
health information inquiries and achieved 84% accuracy
launched the Fake News Challenge in December 2016 [10].
using the SVM model.
Using Machine Learning, Natural Language Processing
B. SOCIAL CONTEXT BASED
(NLP), and Artificial Intelligence (AI), this competition
aimed to encourage the development of technologies that
User generated social media interactions with news stories
could assist human fact-checkers in detecting deliberate
may give additional information, in addition to aspects
deception in news reporting as a first step, the organizers
directly relevant to the substance of the stories. In [26]
decided to research what other media outlets have to say
authors proposed a novel approach employing a knowledge
about the topic. Consequently, they decided to introduce the
graph to identify fake news based on actual content. A event with a stance detection challenge in the first round
graph-kernel- based approach used be [27] to discover of competition. The organizers collected data on headlines
propagation patterns and attitudes. On the other hand, social and body text before the event. In the competition, they
context features are difficult to gather because they are asked participants to create classifiers that could reliably
loud, unstructured, and time-consuming [28]. classify a body text’s viewpoint on a given headline into
one of four categories: ‘‘disagree’’, ‘‘agree’’, ‘‘discuss’’ or
C. STANCE DETECTION OVERVIEW
‘‘unrelated’’. On this task’s test set, the top three teams
From a broad viewpoint, stance detection can be elaborated achieved accuracy rates greater than or equal to 80%. The
as the problem of determining an author’s or text’s point of top team’s model combined Gradient Boosted Decision
29456 VOLUME 11, 2023
A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning
A. DATASET
Challenge Stage 1 (FNC-1) to investigate the potential of
Carnegie Mellon University adjunct professor dean Pomer- machine learning and natural language processing in the
leau, Joostware, and the AI Research Corporation founder fight against fake news [27]. This issue was the driving force
Delip Rao hosted a competition called the Fake News for the competition, which focused on stance detection. This
29460 VOLUME 11, 2023
A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning
section
provides an overview of the competition dataset, the stories written in English. Collecting news
baseline used by the FNC-1 organisers, and the winning stories from
strategies used throughout the competition.
It ensued by turning a news story into a headline,
then annotated the title and using the story to show
where they stood on the assertion they introduced. For
this attitude categorization exercise, we have three possible
sets of labels: ‘‘for,’’ ‘‘against,’’ and ‘‘observing.’’ The
developing dataset [27] is the basis for the FNC-1
competition dataset. To create the FNC-1 dataset, we
randomly match headlines and articles from the emerging
dataset depending on their attitude toward the linked
allegation. In addition, the headlines and articles are
separated into related and unrelated groups. Second, and
more difficult, the collection of connected headline-article
pairings is further split into the three classes disagree,
agree, and discuss, allowing for supervision of the job of
evaluating the attitude of an article relative to the assertion
presented in the associated headline. There are 49,972
headline-article pairs in the training set of the FNC-1
dataset, and another set of pairs in the test set. There are
1,689 distinct headlines and 1,648 unique articles used to
build the headline-article pairings that make up the training
set. The test set includes 904 distinct articles and 894
unique headlines. Seventy-three percent are classified as
unrelated, 7.4 percent as agreeing, 1.7 percent as
disagreeing, and 17.8 percent as debating. About 72.2
percent of the test data is irrelevant; 7.4 percent is in
agreement; 2.7 percent is in disagreement; and 17.6 percent
is up for discussion. The training set has 40,350 headline-
article sets, the hold-out set has 9,622, and the claim set has
25,413 sets.
B. CORPUS DESIGN
The dataset FCN-1 has four distinct classes (agree,
disagree, discuss, unrelated). In pre-processing, labels are
encoded into numeric target values and perform some pre-
processing steps. Preprocessed data is split into 75% data
for training and 25% for testing.
This study used the FNC-1 dataset, consisting of two
CSV files, including stances and body corpora of text news
29462 VOLUME 11, 2023
A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning
C. PRE-PROCESSING
Data mining relies heavily on pre-processing. It converts
inconsistent and incomplete raw data into a machine-readable
representation. Various text preprocessing activities were
conducted on the FNC-1 dataset. To complete these tasks, NLP
approaches such as character conversion to lowercase letters, stop
word elimination, stemming, and tokenization, as well as
algorithms from keras library were used. Stop words, which
comprise words like ‘‘the, of, there,’’ etc., are the most commonly
used words in our daily language and typically have relatively
limited significance in terms of the entire context of the phrase.
By removing the stop words, we save time and space that would
otherwise be consumed by the useless phrases mentioned before.
Words with comparable meanings may appear in the text many
times. For example, ‘‘eating’’ in any sentence will become ‘‘eats’’.
Reducing the language to its most basic form can help if that’s the
case. This operation, known as stemming [51], uses an open-source
version of the NLTK’s Porter stemmer method. Few preprocessing
steps are as follows:
1) Stop Word Removal: Languages commonly use a group of
terms collectively known as ‘‘stop words.’’ The words ‘‘a,’’
‘‘the,’’ ‘‘is,’’ and ‘‘are’’ are all examples of stop words in
English. Stop words are common in text mining and natural
language processing (NLP) to weed out overused words and
thus contain little useful information. NLTK provides the
stop word dictionary
in this instance. To begin, the text is cleaned up by used to calculate the phrase frequency. When working
removing all stop words. It is possible to remove stop with large datasets,
words from the text because they are more common
and carry less useful information. Some common stop
words include the conjunctions ‘and’ ’or’ and ‘but’.
Pre-processing data is essential in natural language
processing because processing these less frequently
used full words consumes a significant amount of
time.
2) Punctuation Removal: The grammatical context of a
sentence is provided by natural language punctuation.
A comma, for example, may not add anything to the
understanding of the statement.
3) Link Removal: This step removes hypertext links
from social media posts. Regular expressions are used
to do this.
4) Lemmatization or stemming: Either lemmatization
or stemming is done during this step. The NLTK’s
WordNet Lemmatizer is used for lemmatization,
while the NLTK’s Snowball Stemmer
implementation is used for stemming, based on the
Porter2 stemming algorithm [52].
5) Apart from the above-mentioned pre-processing
stages, every social media post must go through.
Reply removal: Words beginning with @ (primarily
used for Twitter replies) are eliminated in this phase.
Regular expressions are also used to do this.
6) Lowercase transformation: Every word is
converted to lowercase in this phase to account for
variances in capitalization.
D. FEATURE EXTRACTION
Feature extraction transforms raw data into numerical
features that can be further processed while preserving the
original data set’s information. It is more effective than just
using raw data to train a machine.
1) HASHINGTF
The mapped indices are then used to calculate the phrase
frequency. Bypassing the need for a term-to-index map,
which can be time consuming and expensive for large cor-
pora, this method is less susceptible to hash collisions [45],
where multiple raw features are hashed into the same term.
HashingTF maps a series of phrases to their word
frequencies using the hashing method. Using Austin
Appleby’s Murmur Hash 3 algorithm, we can now
compute the term object’s
hash code value (MurmurHash3 × 86 32). Since the hash
function is translated to a column index using a simple
modulo, the features would not be evenly mapped to the
columns if the numb-Features input was less than a power
of two. The HashingTF transforms a set of terms into
feature vectors of fixed length. Regarding text processing, a
‘‘term set’’ could be a collection of words. HashingTF
employs the hashing technique. A hash function transforms
a raw attribute into an index (term). Murmur-Hash-3 is the
hash function in use here. The mapped indices are then
29464 VOLUME 11, 2023
A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning
2) IDF
Inverse Document Frequency (IDF) is a calculation fre-
quently employed in association with term frequency. The
issue with term frequency is that frequent terms are not
necessarily the most significant. For example, ‘‘content’’ will
appear on every web page. IDF is a method for lowering the
weight of frequently occurring words in a corpus (collection of
documents). IDF is determined by dividing the total number of
documents by the number of documents containing the phrase
in the collection. IDF is an Estimator that generates an IDF
Model after being fitted to a dataset. Feature vectors (typically
created by Hashing-TF or count- vectorizer) are used to scale
each IDF model feature [46]. It appears to downplay qualities
that are common in a corpus.
F. EVALUATION METRICS
The main concern is determining the model’s ability to
discern true and false news. We used metrics to properly
examine the model’s efficiency for this difficult challenge.
Model selection and implementation are essential but
should not take precedence over the rest of the project.
Various assessment measures are used to test data to assess
the model’s capacity to detect false news. Multiple
evaluation metrics, such as classification reports (accuracy,
precision, recall, F1-score) and confusion measures, may be
used to assess machine learning models. The sections that
follow go through each of the assessment measures in
detail. Pre- processing and other ways of gathering fake
news data are loaded into a strong algorithm, producing
incredible results [49].
Observations that match the predictions made by the
model are true positives and negatives, respectively, and are
marked in green. Because we would want to cut down on
both types of errors, the ones we are trying to minimize are
marked below. These phrases don’t make a lot of sense. So,
we can check our understanding by dissecting each
statement.
A True Positive (TP) is a correctly anticipated positive
result when the actual and projected class values are yes.
For instance, if the expected and actual class values suggest
that the passenger made it, we know they did. When both
the actual and anticipated class values are negative, we say
the value is a True Negative (TN). For instance, this
passenger did not survive if both the actual and predicted
29466 VOLUME 11, 2023
A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning
that they did not. When the actual class size differs from the
projected class size, false positives and negatives occur.
When the expected class is present, but the real class is not,
this is called a false positive (FP). The actual class will be
utilized if, for example, it shows that the passenger did not
survive but the fore-cast class predicts that the passenger
would. In cases when the true class is yes but the predicted
class is no, a false negative has occurred. If, for example, the
actual class value reveals that the passenger lived whereas the
expected class value predicted that they would die, the actual
class value would be utilized.
To verify the usefulness of the model, the following
assessment criteria are used:
Precision is the proportion of actual test results that were
predicted correctly. This is calculated by dividing the number of
correct predictions by the number of incorrect ones.
TP + TN
Acc = TP + FN + TN + FN
Precision: To calculate a classifier’s precision, divide the
number of positive outcomes by the number of positive
predictions.
TP
Pr = TP
+ FP
Recall: The total number of positive outcomes divided by
the total number of predicted positive outcomes is used to
determine recall.
TP
Re = TP
+ FP
F1-score: It is a great way to test accuracy and recall
simultaneously. This value is used to gauge accuracy and
recall.
2 × (Precision × Recall)
F1= Precision + Recall
The accuracy of a prediction may be measured with the use
of a classification report (CR). Correct and incorrect
classifications for each category are utilized to determine the
totals. False positive (FP), true negative (FN), and false
positive and negative (FP/FN) are widely used in the
classification report’s construction (FP&N). Several metrics
may be used to evaluate a model’s efficacy, but accuracy is
often prioritized. For example, it incorporates a wide range of
assessment tools including as (accuracy, precision, recall, F1
score, and support.) The backing indicates the number of
occurrences for each class [50]. It represents how much
information out of the total possible may be calculated with
high precision. Number of courses where just the best features
were recalled. An equation may be used to depict this. To get
the F1-score, we add the percentage of correct predictions and
the number of correct recalls. The table summarizes the mean
weighted recall and accuracy for a certain sample. The F1-
score for this model is 1, which means it is ideal. ‘‘Support’’
refers to the number of class occurrences in a given dataset.
The word ‘‘accuracy’’ refers to the proportion of correct
predictions relative to the number of potential ones.
C. DISCUSSION
The FNC-1 dataset, which contains 49,972 headline articles
and four distinct categories, was used to achieve the inves-
tigation’s objectives, and obtain the desired results (discuss,
agree, unrelated, and disagree). The proposed system com-
prises numerous components, such as data pre-processing,
visualization, exploratory analysis, feature extraction, and
classification using machine learning strategies. We proposed
classifying data with an ensemble model influenced by
using various neural network-based models, which are measure,’’ in Proc. 27th Int. Flairs Conf., May 2014.
more sufficient for unsupervised fake news detection. Spark
takes too much training time due to the standalone cluster.
Due to the solitary cluster, Spark takes twice as long to
train. In future, researchers can perform experiments on
creating a cluster on a different machine. This research
may be further stretched by employing various neural
network-based models better suitable for unsupervised fake
news identification. We will try to build a cluster on a
separate computer.
ACKNOWLEDGMENT
We would like to thank Researchers Supporting Project
number (RSPD2023R532) King Saud University, Riyadh,
Saudi Arabia.
REFERENCES
[1] P. H. A. Faustini and T. F. Covões, ‘‘Fake news detection in multiple
platforms and languages,’’ Expert Syst. Appl., vol. 158, Nov. 2020,
Art. no. 113503.
[2] M. D. Vicario, W. Quattrociocchi, A. Scala, and F. Zollo, ‘‘Polarization
and fake news: Early warning of potential misinformation targets,’’ ACM
Trans. Web, vol. 13, no. 2, pp. 1–22, May 2019.
[3] Y. Liu and Y.-F.-B. Wu, ‘‘FNED: A deep network for fake news early
detection on social media,’’ ACM Trans. Inf. Syst., vol. 38, no. 3, pp. 1–
33, Jul. 2020.
[4] J. C. S. Reis, A. Correia, F. Murai, A. Veloso, and F. Benevenuto,
‘‘Supervised learning for fake news detection,’’ IEEE Intell. Syst., vol.
34, no. 2, pp. 76–81, Mar. 2019.
[5] M. Z. Asghar, A. Habib, A. Habib, A. Khan, R. Ali, and A. Khattak,
‘‘Exploring deep neural networks for rumor detection,’’ J. Ambient Intell.
Hum. Comput., vol. 12, no. 4, pp. 4315–4333, Apr. 2021.
[6] R. K. Kaliyar, A. Goswami, and P. Narang, ‘‘DeepFakE: Improving fake
news detection using tensor decomposition-based deep neural network,’’
J. Supercomput., vol. 77, no. 2, pp. 1015–1037, Feb. 2021.
[7] S. S. Jadhav and S. D. Thepade, ‘‘Fake news identification and classifica-
tion using DSSM and improved recurrent neural network classifier,’’
Appl. Artif. Intell., vol. 33, no. 12, pp. 1058–1068, Oct. 2019.
[8] A. Vereshchaka, S. Cosimini, and W. Dong, ‘‘Analyzing and distinguishing
fake and real news to mitigate the problem of disinformation,’’ Comput.
Math. Org. Theory, vol. 26, no. 3, pp. 350–364, Sep. 2020.
[9] F. Monti, F. Frasca, D. Eynard, D. Mannion, and M. M. Bronstein, ‘‘Fake
news detection on social media using geometric deep learning,’’ 2019,
arXiv:1902.06673.
[10] M. H. Goldani, S. Momtazi, and R. Safabakhsh, ‘‘Detecting fake news
with capsule neural networks,’’ 2020, arXiv:2002.01030.
[11] S. Shellenbarger, ‘‘Most students don’t know when news is fake,
Stanford study finds,’’ Wall Street J., vol. 21, 2016.
[12] D. Pierson, ‘‘Facebook and Google pledged to stop fake news. So why
did they promote Las Vegas-shooting hoaxes?’’ Los Angeles Times, Oct.
2017.
[13] G. Zarrella and A. Marsh, ‘‘MITRE at SemEval-2016 task 6: Transfer
learning for stance detection,’’ 2016, arXiv:1606.03784.
[14] S. Ghosh, P. Singhania, S. Singh, K. Rudra, and S. Ghosh, ‘‘Stance
detection in web and social media: A comparative study,’’ in Proc. Int.
Conf. Cross-Lang. Eval. Forum Eur. Lang. Cham, Switzerland: Springer,
2019, pp. 75–87.
[15] A. I. Al-Ghadir, A. M. Azmi, and A. Hussain, ‘‘A novel approach to
stance detection in social media tweets by fusing ranked lists and
sentiments,’’ Inf. Fusion, vol. 67, pp. 29–40, Mar. 2021.
[16] S. Somasundaran and J. Wiebe, ‘‘Recognizing stances in ideological on-
line debates,’’ in Proc. NAACL HLT Workshop Comput. Approaches
Anal. Gener. Emotion Text, 2010, pp. 116–124.
[17] A. Konjengbam, S. Ghosh, N. Kumar, and M. Singh, ‘‘Debate stance
classification using word embeddings,’’ in Proc. Int. Conf. Big Data
Anal. Knowl. Discovery. Cham, Switzerland: Springer, 2018, pp. 382–
395.
[18] A. Faulkner, ‘‘Automated classification of stance in student essays: An
approach using stance target information and the Wikipedia link-based
29474 VOLUME 11, 2023
A. Altheneyan, A. Alhadlaq: Big Data ML-Based Fake News Detection Using Distributed Learning
[19] J. Weedon, W. Nuland, and A. Stamos, Information Operations and
Facebook. Menlo Park, CA, USA: Facebook, 2017.
[20] A. Vlachos and S. Riedel, ‘‘Identification and verification of simple claims
about statistical properties,’’ in Proc. Conf. Empirical Methods Natural Lang.
Process., 2015, pp. 2596–2601.
[21] S. N. Shorabeh, N. N. Samany, F. Minaei, H. K. Firozjaei, M. Homaee, and
A. D. Boloorani, ‘‘A decision model based on decision tree and particle
swarm optimization algorithms to identify optimal locations for solar power
plants construction in Iran,’’ Renew. Energy, vol. 187, pp. 56–67, 2022.
[22] E. Zotova, R. Agerri, and G. Rigau, ‘‘Semi-automatic generation of
multilingual datasets for stance detection in Twitter,’’ Expert Syst. Appl., vol.
170, May 2021, Art. no. 114547.
[23] S. Mishra, P. Shukla, and R. Agarwal, ‘‘Analyzing machine learning enabled
fake news detection techniques for diversified datasets,’’ Wireless Commun.
Mobile Comput., vol. 2022, pp. 1–18, Mar. 2022.
[24] A. Spark, ‘‘Apache spark,’’ Retrieved January, vol. 17, no. 1, p. 2018, 2018.
[25] A. Sen, M. Sinha, S. Mannarswamy, and S. Roy, ‘‘Stance classification of
multi-perspective consumer health information,’’ in Proc. ACM India Joint
Int. Conf. Data Sci. Manage. Data, Jan. 2018, pp. 273–281.
[26] S. V. Vychegzhanin and E. V. Kotelnikov, ‘‘Stance detection based on
ensembles of classifiers,’’ Program. Comput. Softw., vol. 45, no. 5,
pp. 228–240, Sep. 2019.
[27] C. Silverman, ‘‘Lies, damn lies and viral content,’’ Tow Center Digit.
Journalism, Columbia Univ., New York, NY, USA, 2015.
[28] S. Harabagiu, A. Hickl, and F. Lacatusu, ‘‘Negation, contrast and
contradiction in text processing,’’ in Proc. AAAI, vol. 6, 2006, pp. 755–762.
[29] S. Mohammad, S. Kiritchenko, P. Sobhani, X. Zhu, and C. Cherry,
‘‘SemEval-2016 task 6: Detecting stance in tweets,’’ in Proc. 10th Int.
Workshop Semantic Eval. (SemEval). San Diego, CA, USA: Association for
Computational Linguistics, 2016, pp. 31–41.
[30] B. G. Patra, D. Das, and S. Bandyopadhyay, ‘‘JU_NLP at SemEval-2016 task
6: Detecting stance in tweets using support vector machines,’’ in Proc. 10th
Int. Workshop Semantic Eval. (SemEval), 2016, pp. 440–444.
[31] H. Elfardy and M. Diab, ‘‘CU-GWU perspective at SemEval-2016 task 6:
Ideological stance detection in informal text,’’ in Proc. 10th Int. Workshop
Semantic Eval. (SemEval), 2016, pp. 434–439.
[32] I. Augenstein, T. Rocktäschel, A. Vlachos, and K. Bontcheva, ‘‘Stance
detection with bidirectional conditional encoding,’’ 2016, arXiv:1606.05464.
[33] P. Wei, W. Mao, and D. Zeng, ‘‘A target-guided neural memory model for
stance detection in Twitter,’’ in Proc. Int. Joint Conf. Neural Netw. (IJCNN),
Jul. 2018, pp. 1–8.
[34] S. Zhou, J. Lin, L. Tan, and X. Liu, ‘‘Condensed convolution neural network
by attention over self-attention for stance detection in Twitter,’’ in Proc. Int.
Joint Conf. Neural Netw. (IJCNN), Jul. 2019, pp. 1–8.
[35] M. Taulé, M. A. Martí, F. M. Rangel, P. Rosso, C. Bosco, and V. Patti,
‘‘Overview of the task on stance and gender detection in tweets on Catalan
independence at IberEval 2017,’’ in Proc. 2nd Workshop Eval. Hum. Lang.
Technol. Iberian Lang. (CEUR-WS), vol. 1881, 2017, pp. 157–177.
[36] M. Lai, A. T. Cignarella, D. I. Hernández Farías, C. Bosco, V. Patti, and
P. Rosso, ‘‘Multilingual stance detection in social media political debates,’’
Comput. Speech Lang., vol. 63, Sep. 2020, Art. no. 101075.
[37] S. Sommariva, C. Vamos, A. Mantzarlis, L. U.-L. Dào, and D. M. Tyson,
‘‘Spreading the (fake) news: Exploring health messages on social media and
the implications for health professionals using a case study,’’ Amer. J. Health
Educ., vol. 49, no. 4, pp. 246–255, Jul. 2018.
[38] B. Riedel, I. Augenstein, G. P. Spithourakis, and S. Riedel, ‘‘A simple but
tough-to-beat baseline for the Fake News Challenge stance detection task,’’
pp. 1–6, May 2018, arXiv:1707.03264.
[39] Q. Zhang, S. Liang, A. Lipani, Z. Ren, and E. Yilmaz, ‘‘From stances’
imbalance to their hierarchical representation and detection,’’ in Proc. World
Wide Web Conf., May 2019, pp. 2323–2332.
[40] C. Dulhanty, J. L. Deglint, I. B. Daya, and A. Wong, ‘‘Taking a stance on
fake news: Towards automatic disinformation assessment via deep
bidirectional transformer language models for stance detection,’’ 2019,
arXiv:1911.11951.
[41] B. Pouliquen, R. Steinberger, and C. Best, ‘‘Automatic detection of
quotations in multilingual news,’’ in Proc. Recent Adv. Natural Lang.
Process., 2007, pp. 487–492.
[42] M.-C. De Marneffe, A. N. Rafferty, and C. D. Manning, ‘‘Finding
contradictions in text,’’ in Proc. Assoc. Comput. Linguistics, 2008,
pp. 1039–1047.