2024 Eai Airo
2024 Eai Airo
Abstract
INTRODUCTION: This research focuses on the increasing importance of social media websites as versatile platforms for
entertainment, work, communication, commerce, and accessing global news. However, it emphasizes the need to use this power
responsibly.
OBJECTIVES: The objective of the study is to evaluate the performance of artificial intelligence algorithms in detecting fake
news.
METHODS: Through a comparison of six machine learning algorithms and the use of natural language processing techniques,
RESULTS: The study identifies four algorithms with a 99% accuracy rate in detecting fake news.
CONCLUSION: The results demonstrate the effectiveness of the proposed method in enhancing the performance of artificial
intelligence algorithms in addressing the problem of fake news detection.
Keywords: Machine Learning, Natural Language Processing, Fake News, Scikit-Learn, Python
Copyright © 2024 R. H. Al-Furaiji et al., licensed to EAI. This is an open access article distributed under the terms of the CC BY-NC-SA
4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original
work is properly cited.
doi: 10.4108/airo.4153
Table 1: accuracy score paper [1] In research papers [6] and [7], the focus was on machine
learning algorithms such as Passive Aggressive classifier and
Model DS1 DS2 DS3 DS4 Logistic regression that were applied in a set of algorithms to
Logistic 0.97 0.91 0.91 0.87 determine the best algorithm for detecting fake news. In the
Regression paper [6], four machine learning algorithms were used,
Decision 0.98 0.94 0.94 0.9 including the Naïve Bayes algorithm, Support vector machine
Tree classification, Logistic regression, Passive Aggressive
classifier, and some natural language processing techniques
such as a bag of words and TD-IDF. The practical application
Table 2: accuracy score paper [3] results showed that the Passive Aggressive classifier gave the
highest accuracy rate of 99.5%. In the paper [7], the researchers
sought to determine whether the idea of using artificial
Model K- K- K- K- intelligence to solve the problem of fake news is valid or not by
FOLD- FOLD- FOLD- FOLD-
creating a website that can help users verify fake news. Three
1 2 3 4
Decision 0.94 0.91 0.96 0.95 machine learning algorithms were used, namely Logistic
Trees Regression, Naïve Bayes, and Random Forest, as well as some
natural language processing techniques, and the results are
shown in the table below:
Paper [4] presents research on various machine learning
algorithms, both individual classification algorithms and
collaborative learning algorithms. The study also incorporates Table 4: accuracy score paper [6]
natural language processing techniques to identify the optimal
algorithm for detecting fake news. Specifically, the paper Model Accuracy
explores the effectiveness of Logistic Regression, Decision Logistic Regression 0.98
Trees, Random Forest, SVC, AdaBoost, and XGBoost in this Passive Aggressive 0.99
task. classifier
The research paper [5] presents a solution for detecting fake
news in Spanish by using traditional text feature extraction
tools like TF-IDF and Stack, and a weak classifier-based Table 5: accuracy score paper [7]
ensemble learning approach. The proposed method extracts
additional information about the text, such as information about
publishers and topics published, and uses a variety of machine Model Accuracy
Logistic Regression 0.65
learning algorithms such as Logistic Regression, Passive
Passive Aggressive 0.92
Aggressive Classifier (PAC), SGD Classifier, and Ridge classifier
Classifier (RC). The test and validation datasets results were
competitive, with an accuracy rate in IberLEF 2021 that was
8% higher than in MEX-A3T 2020. The researchers In the paper [8] three models were utilized, including
participated in the IberLEF 2021 campaign, which evaluates RoBERTa, a pre-trained language model from the BERT
systems that process natural language in Spanish and other family, Bi-LSTM, a deep learning algorithm, and Passive
Iberian languages. Aggressive Classifier, a machine learning technique. It was
discovered that RoBERTa delivered the best results with
monolingual texts after choosing the right textual attributes and
Table 3: accuracy score paper [5] contrasting the results obtained by the three models. Bi-LSTM,
however, performed better when dealing with texts in multiple
languages.
Model Merge Stop Accuracy Janicka et al. [9], a system design that may be applied to real-
All words
time news accuracy prediction is proposed. Data features are
Logistic Yas No 0.859
Regression (LR)
extracted from the data using natural language processing, and
SGD Classier Yas No 0.865 machine learning classifiers like Naive Bayes, Support Vector
(SGDC) Machine (SVM), Random Forest (RF), Stochastic Gradient
Passive Yas No 0.870 Descent (SGD), and Logistic Regression are trained using these
Aggressive features (LR). Many criteria are used to assess each classifier's
Classier (PAC) performance. Then, the best classifier is implemented as a web
Ridge Yas No 0.862 application using the Flask API to predict the accuracy of news
Classier(RC) in real-time.
In the paper [10], the researchers focused on providing an
answer to an important and fundamental question, which is
whether it is possible to build a model capable of handling all for text processing and classification. In our research, some of
types of data that it has been trained on and those that it has not the selected algorithms are less well-known, as there are not
been trained on with the same efficiency. Based on this, the many studies that discuss these algorithms, and we did not find
researchers built a model based on analyzing non-linguistic any previous study comparing the six algorithms we chose for
features such as writing style and psychological factors using this research. Therefore, we took the initiative to compare these
four machine learning algorithms, including LinearSVC, SGD six algorithms.
Classifier, Extra Trees Classifier, and XGBoost. This model In this section, we have four stages to reach the desired results.
was trained on four sets of data and the results showed that the In the first stage, we analyze the data. In the second stage, we
performance of the model depends on the textual features and process the data and convert it into vectors. In the third stage,
characteristics that it was trained on, and any interaction with we train the algorithm on the training data. Finally, in the fourth
other types of data reduces the efficiency of the model by 20%. stage, we evaluate the performance of the algorithm on the test
Therefore, training the model on data sets from various data.
domains is important.
In the paper [11], machine learning and deep learning
algorithms were employed to ascertain the accuracy of news by A. Data Analysis
determining its veracity. Two methodologies, both centered
around language, were utilized in this study. The first approach Using visual and analytical analysis, it's possible to identify and
involved concatenating the news title and major news material verify fake news effectively. Therefore, we analyze the data set
for the experiment. The second approach focused solely on the by displaying the most important words using a word cloud
news content, disregarding the title. tool, in addition to displaying the top 20 most frequently
In [12] [13], the researchers used Logistic Regression and repeated words in the data set using Count Vectorizer.
Naive Bayes machine learning algorithms to detect fake
political news. In the paper [12], the algorithms were applied in A.1 word cloud: A visual representation of the frequency of
two stages: the first stage involved detecting fake news, while words in a text where the size of the word denotes the
the second stage was verifying fake news detection. In paper frequency.
[13], a framework for detecting fake news has been proposed
based on feature extraction and selection techniques such as A.2 Top 20 Words: By counting the top 20 most frequent
inverse document frequency and bag of words, as well as a set words, Count Vectorizer can offer insightful information
of classifiers including voting classifiers. The extracted about the textual data that has been studied. This can assist in
features were reduced using various machine learning understanding the words that appear in the text the most
algorithms and a variance analysis algorithm. frequently as well as in locating relevant patterns and key
In the paper [14], the researchers worked on discovering fake terms.
news on the Twitter platform. However, tweets on Twitter are
short texts that are not subject to any syntactic rules. Therefore,
the researchers proposed using tweet features such as the
number of retweets, likes, and tweet length. They trained a set
of algorithms including SVM, decision tree, and neural
network.
Adedoyin et al. [15], the performance of seven algorithms was
evaluated, including five machine learning algorithms and two
deep learning algorithms, in the field of detecting fake news on
social media platforms. Using a set of measurement tools, the
best model was determined, and a web application was built for
this purpose.
2. Methodology
Figure 1. Bar Chart of Top Words Frequency.
In this research paper, we will evaluate and compare the
performance of six machine learning algorithms to determine
the best algorithm for detecting fake news. These algorithms B. Data pre-processing
are Extra Tree Classifier, Logistic Regression, SGD Classifier,
Passive Aggressive Classifier, Ridge Classifier, and Decision Text written in natural language is often chaotic and noisy
Tree Classifier, with the assistance of some natural language because it contains so much unimportant information. Text
processing tools. The selected algorithms are provided by the preparation is required to get it ready for additional
scikit-learn (sklearn) library written in Python, and they are investigation and education. The text must be transformed into
among more than 20 other algorithms available in this library an organized and dependable format to feed the content into a
3
EAI Endorsed Transactions on
AI and Robotics
| Volume 3 | 2024 |
R. H. Al-Furaiji and H. Abdulkader
model. In this research, we utilized three natural language • Calculating inverse document frequency (IDF)
processing tools, which are: measures a word's importance across the full corpus
• remove irrelevant spaces, of documents. It displays the frequency with which
• remove punctuations mark, the word appears across all of the publications.
• remove the stopwords,
• split data into training and test. IDF(t) = log((total documents in the collection) /
(documents with the phrase t in them))
B.1 Remove irrelevant spaces: To ensure data
the formula for computing IDF
cleanliness, it is crucial to eliminate any additional
spaces and eradicate newlines, tabs, and any form of
white space from the given dataset. • Calculating Term Frequency-Inverse Document
Frequency (TF-IDF): By dividing a word's TF value
by its IDF value, the TF-IDF is calculated. For the
B.2 Remove punctuation marks: When working full collection of papers, it represents the normalized
with machine learning and data processing tasks, it's phrase frequency. TF-IDF is calculated using the
common to remove punctuation as a preliminary step. following formula:
B.3 Remove the stop words: We try to stay away TF-IDF(t) = TF(t) * IDF(t)
from any phrases that use up extra resources so that our
database can be stored and processed as efficiently as the formula for computing TF_IDF
possible. This can be accomplished by keeping a list of
stop words, or words with limited or no meaning in a These procedures transform the text into numerical vectors that
particular language. For instance, NLTK offers a set of correspond to the document's word representations. Higher TF-
stop words in 16 different languages that we can use. IDF values imply that the documents contain unique and
important words.
B.4 Split data into train and test: As a pre-
processing step in machine learning, dividing the
modeling dataset into training and testing samples is D. Train the Models
essential. The performance of the model can be
evaluated by generating multiple training and testing In this research paper, we have six supervised learning
samples. algorithms that will be applied to two manually classified
datasets to evaluate the performance of these algorithms and
determine the best algorithm for solving the problem of fake
C. Converting text into Vectors news among the six algorithms.
Text vectorization converts textual data into a numerical D.1 Logistic Regression (LR): For categorization and
format, enabling computers to understand and handle the input. predictive analytics, this form of statistical model, often
In this paper, we will use TF-IDF to convert text into a
called a logistic regression model, is frequently employed.
numerical format and to understand what TF-IDF it is first
Based on a specific dataset of independent variables,
necessary to define term frequency (TF) and inverse document
frequency (IDF) separately. Let's first get acquainted with the logistic regression calculates the probability of an event
term frequency (TF). It is a term that, said simply, denotes the occurring, such as voting or not voting. The dependent
weightiness of a word in a document, The term inverse variable is limited to the range of 0 and 1, as the result is a
document frequency (IDF) represents the weight of a word probability.
across all documents.
The following steps are used in the TF-IDF technique to D.2 SGD Classier (SGDC): For creating linear
convert text into numerical representations: classifiers and calculating gradients under convex loss
functions, such as logistic regression and linear classifiers,
• Calculation of Term Frequency (TF): Based on the stochastic gradient descent (SGD) is seen to be a
frequency of each word in the text, TF generates a straightforward and effective method. The SGD
numerical value. It shows how frequently a word methodology is essentially a method for training models; it
appears in the text. is not connected to any particular family of machine
learning models. It has been effectively used to solve a wide
TF (t)= (number of times word t appears in the variety of machine-learning issues, including text
text) / (total number of words in the text) categorization and natural language understanding.
D.3 Passive Aggressive Classier (PAC): The online 3. Results and Discussion
learning algorithms encompass various approaches, one of
which is the Negative Aggressive Learning Algorithm. This In this study, the quality of machine learning algorithms'
algorithm exhibits a unique behavior where it responds performance in fake news detection was evaluated using
negatively when it correctly classifies data and reacts performance indicators such as accuracy, recall, F1-score, and
strongly when it makes incorrect classifications. precision. These measurements were based on the analysis of
Specifically designed for online machine learning, the the confusion matrix. The experimental results involved the
negative aggressive algorithm learns gradually by application of six machine-learning algorithms to two manually
processing consecutive examples. This learning process can trained datasets. The findings indicated that the performance of
be performed individually or by grouping small samples these algorithms in the field of fake news detection has reached
into mini batches. The primary objective of the negative advanced stages. Among the algorithms tested, the Extra Tree
aggressive algorithm is to become proficient at identifying algorithm achieved a minimum accuracy rate of 86%.
accurate classifications and responding negatively to them. However, the remaining algorithms surpassed 99% accuracy,
indicating their superior performance in this task.
Upon receiving a new sample, the algorithm assesses
whether it belongs to the correct classification and reacts
aggressively if an incorrect judgment is made. To enhance Confusion Matrix: An effective tool for visualizing the
its capabilities, the negative aggressive algorithm undergoes many outcomes of predictions and results in a classification
continuous training. This training aims to improve its ability task is the confusion matrix. It displays a table with all of a
to identify accurate classifications and enhance its classifier's predicted and actual values. There are four
primary sections in the Confusion matrix:
performance in handling inaccurate classifications.
D.4 Ridge Classier (RC): This is a method used in • True Positive (TP): Indicates how many true positive
machine learning to examine linear discriminant models. To samples the model properly identified.
avoid overfitting, this type of regularization penalizes • False Positive (FP): Indicates the number of samples
that the model misclassified as positive and so
model coefficients. When a model is overly complicated
counted as false positives.
and captures noise in the data rather than the underlying
• True Negative (TN): Indicates how many true
signal, it is known as overfitting, a prevalent problem in negative samples the model properly identified.
machine learning. On new data, this may result in subpar • False Negative (FN): Indicates the number of
generalization performance. Ridge classification tackles samples that the model misclassified as negative and
this issue by including a penalty term that deters complexity so counted as false negatives.
in the cost function. As a result, the model's ability to
generalize to new data is improved. The goal values are first In this research, we find that the color shades in the cells of the
transformed into the values "-1," and "1," and the problem Confusion matrix are very close. In our study, the matrix
is then handled as a regression task (multi-output regression consists of two-color shades: pink and black. The pink color
in the case of many classes). represents True Positive (TP) and True Negative (TN), while
the black color represents False Positive (FP) and False
D.5 Decision Tree Classifier (DT): Suitable for Negative (FN). By following the obtained chart, we can see that
classification and regression tasks, decision trees (DTs) are the models perform very well, and this can be observed when
a non-parametric supervised learning method. Their main examining the results using other measurement tools that rely
goal is to build a model that can foretell the value of the primarily on matrix outcomes.
target variable. This is accomplished by taking the simple
decision rules from the data's features and applying them. A
conceptual representation of a tree can be seen as an
approximation made up of segments with piecewise
constant length.
5
EAI Endorsed Transactions on
AI and Robotics
| Volume 3 | 2024 |
R. H. Al-Furaiji and H. Abdulkader
[4] Singh, G., & Selva, K. (2023). Detection of fake news using
Accuracy Score NLP and various single and ensemble learning classifiers.
https://doi.org/10.36227/techrxiv.21856671.v2
1 [5] HAHA at FakeDeS 2021: A Fake News Detection Method
0.8 Based on TF-IDF and Ensemble Machine Learning. (n.d.).
[6] Thirupathi, L., Gangula, R., Ravikanti, S., Sowmya, J., &
0.6 Shruthi, S. K. (2021). False news recognition using machine
0.4 learning. Journal of Physics. Conference Series, 2089(1),
0.2 012049. https://doi.org/10.1088/1742-
0 6596/2089/1/012049
P1 p3 p5 p6 p7 DS1 DS2 [7] Sharma, , U., Saran, , S. and Patil, S.M. (2021) ‘Fake News
Detection using Machine Learning Algorithms’,
ET LG SDG PAC RC DTs International Journal of Engineering Research &
Technology (IJERT), pp. 509–518.
[8] Arif, M. et al. (2022) ‘CIC at CheckThat! 2022: Multi-class
Figure 4. Accuracy Score and Cross-lingual Fake News Detection’, CLEF 2022:
Conference and Labs of the Evaluation Forum [Preprint].
Available at:
Figure 4 compares our obtained results against state of art. P1 https://ipn.elsevierpure.com/en/publications/cic-at-
to p7 are respectively papers [1] – [7]. checkthat-2022-multi-class-and-cross-lingual-fake-
news-det (Accessed: 2022).
Conclusion [9] Janicka, M., Pszona, M., & Wawer, A. (2019). Cross-
domain failures of fake news detection. Computación y
Sistemas, 23(3). https://doi.org/10.13053/cys-23-3-
In this research paper, a system for the detection of fake news 3281
is implemented using machine learning and natural language [10] Arif, M. et al. (2022) ‘CIC at CheckThat! 2022: Multi-class
processing. Six different machine learning were implemented, and Cross-lingual Fake News Detection’, CLEF 2022:
and their results are compared using several metrics. Metrics Conference and Labs of the Evaluation Forum [Preprint].
based on 2 different datasets showed ML achieve accuracy Available at:
rates ranging from 86% to 99%. Four of these algorithms https://ipn.elsevierpure.com/en/publications/cic-at-
achieved very high accuracy rates, reaching up to 99%, with checkthat-2022-multi-class-and-cross-lingual-fake-
very small fractional differences between them. These news-det (Accessed: 2022).
algorithms are SGD, PAC, RC, and DTs. Following them is the [11] Team Sigmoid at CheckThat!2021 Task 3a: Multiclass fake
LR algorithm, which achieved an accuracy rate of 98%. news detection with Machine Learning. (n.d.).
[12] Sudhakar, M., & Kaliyamurthie, K. P. (2022). Effective
Finally, the ET al algorithm yielded the lowest accuracy rate at
prediction of fake news using two machine learning
86%, which is still not considered a poor percentage. This algorithms. Measurement. Sensors, 24(100495), 100495.
demonstrates the significant advancements in the use of https://doi.org/10.1016/j.measen.2022.100495
artificial intelligence and natural language processing [13] Gaikwad, T., Rajale, B., Bhosale, P., Vedpathak, S., &
techniques in detecting fake news. Adagale, M. S. S. (2022). Detection of fake news using
Machine Learning Algorithms. International Journal for
Research in Applied Science and Engineering
Reference Technology, 10(12), 1506–1510.
https://doi.org/10.22214/ijraset.2022.48172
[1] Iftikhar Ahmad, Muhammad Yousaf, Suhail Yousaf , [14] Diaa SALAMA ABD ELMINAAM, Yomna M.I.HASSAN,
Muhammad Ovais Ahmad, “Fake News Detection Using Radwa MOSTAFA, Abd Elrahman TOLBA ,Mariam
Machine Learning Ensemble Methods,” Hindawi, p. 11, KHALED, and John GERGES“Fake News Detector Using
2020. https://doi.org/10.1155/2020/8885861 Machine Learning Algorithms” Proceedings of the 36th
[2] Khanam, Z., Alwasel, B. N., Sirafi, H., & Rashid, M. (2021). International Business Information Management
Fake news detection using machine learning Association (IBIMA), ISBN: 978-0-9998551-5-7, 4-5
approaches. IOP Conference Series. Materials Science and November 2020,Granada, Spain.
Engineering, 1099(1), 012040. [15] Adedoyin, F., & Mariyappan, B. (2022b). Fake news
https://doi.org/10.1088/1757-899x/1099/1/012040 detection using machine learning algorithms and recurrent
[3] Albahr, A., & Albahar, M. (2020). An Empirical neural networks. In SAGE Advance.
Comparison of Fake News Detection using Different https://doi.org/10.31124/advance.20751379.v1
Machine Learning Algorithms. International Journal of
Advanced Computer Science and Applications :
IJACSA, 11(9).
https://doi.org/10.14569/ijacsa.2020.0110917
7
EAI Endorsed Transactions on
AI and Robotics
| Volume 3 | 2024 |