A Transformer-Based Approach For Fake News Detection Using Time Series Analysis
A Transformer-Based Approach For Fake News Detection Using Time Series Analysis
A Transformer-Based Approach For Fake News Detection Using Time Series Analysis
Abstract—Fake news is a growing problem in the digital age, might be different motives to spread fake news including finan-
spreading misinformation and affecting public opinion. Existing cial benefits, shaping certain ideologies or public opinion, as
fake news detection is based on style analysis or news generator’s a tool of propaganda, and manipulating outcomes of elections
behavior analysis, the former fails if the content is generated
using existing news corpus while the latter requires a very specific or other events [1]. The impact of fake news has far-reaching
data set. In this research, we aim to address the issue of fake news implications for various aspects of society such as politics,
detection by employing deep learning-based time series analysis democracies, journalism, public opinion, and even personal
(TSA). We propose a TSA method to gauge the authenticity relationships [2]. Therefore, detecting fake news is of prime
of the news based on previously available news content of a importance since fake news can have serious consequences,
similar genre. We employed pre-trained models for encoding
including GloVe and BERT and transformer-based sequence- particularly in the realm of politics and public discourse. There
to-sequence (Seq2Seq) for TSA. The results demonstrate 98% are several techniques to detect fake news including content
accuracy of pre-trained models, such as GloVe and BERT, over analysis, sentimental and style analysis. There are several
traditional encoding approaches having accuracy between 77% surveys written in this field from various perspectives, some
and 93%. Our study also compares the effectiveness of various key surveys include [2], [3], [4] and [5] provide an overview
deep learning methods, including Long-Short Term Memory
(LSTM), Gated Recurrent Unit (GRU), LSTM with attention, of the various techniques used for fake news detection. The
GRU with attention, and Transformers with 8 and 16 multi- study [5] provides a comprehensive survey of fake news
heads. The results show that Transformers with 8 and 16 multi- detection, covering topics such as the definition of fake news,
heads achieve 98% and 97% accuracy respectively as compared challenges in fake news detection, and the various methods
to LSTM (87%) and GRU (88%). Our work is useful for future used for fake news detection. Besides other techniques, an
research in TSA-based fake news detection and proposes to
use GloVe and BERT-based encoding and multi-head transform emerging field in fake news detection pertains to time series
architecture. analysis and analyzing how authentic news could be based on
Index Terms—NLP, Deep Learning, Time Series Analysis, Fake a previous news item in the same genre. Most of the fake
News, Sequence2Sequence, Attention, Word Embedding. news algorithms are based on content or style analysis (CSA),
or the content writer’s behavior analysis (CWB). Collecting
I. I NTRODUCTION datasets for CSA is difficult, moreover, fake news is mostly
generated using existing news items, which makes it difficult
Fake news is counterfeit content presented as real informa- to detect fake news using content analysis. Similarly collecting
tion. There are numerous methods through which news can data for CWB is very challenging since the content writer’s
proliferate to the public among which social media is a major behavior might not be easily available. One of the emerging
source of fake news. Social media and other e-platforms make and unexplored fields is fake news detection using time series
news publishing simpler, however, at the same time it becomes analysis (TSA). TSA is an important tool that uses trends
extremely difficult to verify the authenticity of the news. There of previously available data to predict and forecast future
Authorized licensed use limited to: Politecnico di Milano. Downloaded on August 25,2023 at 18:28:20 UTC from IEEE Xplore. Restrictions apply.
X +
x
+
to determine the authenticity of NT based on existing relevant gates to control the flow of information through RNN cells
time series BT transformers are used. Each λi is passed on the [13], [14]. Standard LSTM use input, output, and forget gates
input side of ith of encoder, while VT is passed on the decoder with four states namely: forget Fk , store Sk , update Uk , and
side of Seq2Seq respectively. Each Seq2Seq generates the output Ok as shown in Fig. 2. Where CT , CT −1 are next cell
relevance score which is bagged to get the overall authenticity and last cell states respectively, HT , HT −1 are next cell and
score. We have employed a GRU-based network to ensemble last cell outputs and XT −1 is input.
the outputs of all decoders, each ith Seq2Seq feeds its output It could be seen in Fig. 2 that the control gates have sigmoid σ
to the ith cell of the GRU network. having a point-wise multiplication unit. The value of σ decides
the amount of information flowing through each gate, it could
III. M ETHODOLOGY AND PROPOSED TECHNIQUE range from zero (no information flow) to one (all information
flowing) through the gate. In the first stage forget fk controls
As discussed in II, the news corpus undergoes several
how much information will be removed, as shown in equation
steps before a time series could be made. However, in the
1.
previous section high-level discussion of each step was made
this section provides further details for each step. Firstly the
fk = σ(WF .[HT −1 , XT ] + BF ) (1)
entire news is tokenized and stop words are removed from the
total tokens, followed by lemmatization, stemming, and named Where, BF is biased and WF is the weight for forget gate,
entity recognition (NER). Then word embedding is applied while HT −1 is the last cell output. The store sk block of the
to encode text into numerical vectors so that it could be fed cell decides the information to be stored.
into any deep learning algorithm [9]. In the last step LSTM,
Attention, and transformer-based techniques are presented for s˜k = σ(WS̄ .[HT −1 , XT ] + BS̄ ) (2)
time series analysis. We have also discussed how a news item’s
authenticity is determined based on existing time series of the
same topic. RNNs are variants of neural networks which can sk = T anH[WS .[HT −1 , XT ] + BS ] (3)
handle sequence data. The vanilla neural networks use current The CT , next cell state is the combination of product of fk
input to generate the output, and RNNs use current input from equation 1 with previous cell state CT −1 and S˜k × Sk
plus states of previous outputs to compute output. Besides from equation 2 and equation 3.
computing output, they also compute the next state [10], [11].
Traditionally text requires time sequences of very long length, CT = fk ∗ CT −1 + s˜k ∗ sk (4)
this means that RNNs have a very long chain of cells, which
leads to vanishing gradients [12]. A variant of RNN known as To create the output HT , the cell state is filtered using tanh.
LSTM has been proposed in the literature which uses special The output part of the gate ok controls the information which
Authorized licensed use limited to: Politecnico di Milano. Downloaded on August 25,2023 at 18:28:20 UTC from IEEE Xplore. Restrictions apply.
SM SM SM SM
will go to the output as shown in equation 5. Equation 6 shows γi,T at time T as shown in equation 8. The AS is passed
how the cell creates output HT , which is the output part of the through the softmax layer to normalize them resulting in
gate ok times, stored information sk filtered through T anH. attention weights (AW) as shown in equation 9 as shown in
Fig. 3.
ok = σ(WO .[HT −1 , XT ] + BO ) (5)
γi,T = FAT T (S0 , Hi ), i ∈ N (8)
HT = ok ∗ T anH(sk ) (6)
X
Another variation of RNN is called GRU (Gated Recurrent 0 < ai < 1 ai,T = 0 (9)
Unit), it is a simplified variation of LSTM [15]. GRU removes i
the cell state and uses a hidden state to transfer information, The concept of an attention mechanism in neural networks
and receives a similar performance as LSTM using a much was first introduced by Bahdanau et al. [18]. Since then,
simpler architecture [16]. the attention mechanism has been modified for various
1) Attention and transformers: Seq2Seq is a well-known applications such as image captioning, question answering,
model employing LSTM or GRU used for machine translation, and others [19]–[24]. The generalized encoder-decoder
image captioning, and other variety of tasks [17]. It consists architecture of attention models was later proposed by
of two parts, an encoder which takes input {X1 , X2 , . . . XN } Vaswani et al. [25]. When referring to attention models in
and obtains the context vector CT and initial decoder state S0 . language processing, the neural machine translation model by
Both CT and S0 are passed to the decoder which generates Bahdanau et al. is often cited [18]. The transformer model,
an output sequence {Y1 , Y2 , . . . YM }, where the length of introduced by Vaswani et al., consists of an encoder-decoder
sequences N and M as shown in Fig. 3. Each decoder state architecture that is highly parallelizable due to the multi-head
SK needs previous input YT −1 , previous hidden state HT −1 attention mechanisms and point-wise fully-connected layers.
and the context vector C. It is also capable of capturing long-term dependencies thanks
to attention mechanisms and positional encoding. In order
ST = F (YT −1 , HT −1 , C) (7) to apply the transformer-based Seq2Seq model to our data,
However, Seq2Seq suffers a drawback, since it encodes the the decoder receives the encoded news items item whose
input into a single context vector which is not suitable for authenticity is to be gauged. While the encoder is fed with
longer sequences. It has been proved by Bahdanau et al. that different λi taken from bm bucket as shown at the end of Fig.
concentrating on certain parts of the input sequence while 1. The encoder side of Seq2Seq will learn the time series
generating output at a particular point in time increases the relationship and captures it using context CT . To summarize
accuracy of the system commonly known as the attention the overall methodology includes:
mechanism [18]. Initially, the hidden state of encoding hi
and initial decoding state S0 is passed through multilayer ■ Preprocess news items and divide them into different
perceptron FAT T to compute so-called alignment scores (AS) topics using LDA.
Authorized licensed use limited to: Politecnico di Milano. Downloaded on August 25,2023 at 18:28:20 UTC from IEEE Xplore. Restrictions apply.
Fig. 4: Accuracy plot of different datasets w.r.t deep learning techniques
2*Word Embedding/ Encoding LSTM GRU LSTM+Attention GRU+Attention Transformers 8 MHD Transformers 16 MHD
F-1 Score Acc F-1 Score Acc F-1 Score Acc F-1 Score Acc F-1 Score Acc F-1 Score Acc
TF-IDF 0.67 0.66 0.68 0.68 0.75 0.75 0.75 0.75 0.78 0.77 0.78 0.77
UniGram+BiGram 0.71 0.72 0.73 0.72 0.78 0.77 0.78 0.77 0.78 0.79 0.79 0.79
BOWs 0.75 0.75 0.76 0.76 0.82 0.83 0.82 0.83 0.85 0.84 0.83 0.84
Word2Vec 0.81 0.81 0.81 0.8 0.85 0.86 0.85 0.86 0.92 0.91 0.93 0.93
LIWC 0.82 0.83 0.81 0.81 0.85 0.85 0.85 0.85 0.91 0.92 0.92 0.92
DOC2Vec 0.82 0.82 0.83 0.82 0.86 0.86 0.86 0.86 0.92 0.91 0.92 0.93
GloVe 0.87 0.87 0.88 0.88 0.93 0.94 0.93 0.94 0.96 0.97 0.98 0.98
BERT 0.87 0.87 0.87 0.88 0.94 0.94 0.94 0.94 0.96 0.97 0.99 0.98
TABLE I: Comparison of different encoding schemes with deep learning pipelines using Dawn dataset
■ Summarizing news items and placing them into relevant 1) Dawn News is collected using a Python-based web
topic buckets. scrapper and library BeautifulSoup from the Pakistani
■ Dividing news into different topics using LDA. newspaper The Dawn. The scrapper loops over different
■ Encoding news items using word embedding techniques numbers of days and saves different News sections as
using BERT and GloVe and flattering them into vectors. CSV files. The saved data has a body, headings, dates,
■ A attention-based Seq2Seq model is trained to predict the and different links to the given News.
authentication score by inputting each bucket item. 2) LIAR is a US-based data set having statements given
■ The news whose authenticity is to be determined is by different Politicians [26]. There are over 12.8 labeled
preprocessing, its topic is determined and is passed on statements.
to the decoder side with news in the relevant buckets fed 3) Pheme is a Twitter-based dataset containing labels i.e.
to the encoder side. non-rumors and rumors [27].
4) NELA-GT is a data set collected from News articles
IV. R ESULTS AND P ERFORMANCE E VALUATION from Kaggle, Factmata, etc [28]. It has labels such as
This section presents the results of the proposed algorithm, real or fake.
the first sub-section IV-A shows the dataset. In the second sub- 5) CREED is a dataset based on mixed datasets including
section IV-B, we have elaborated on the experimental setup. Wikipedia, WHO, etc. [29]. The dataset has 15K labeled
We have compared different word embedding as well as deep data containing True and False labels.
learning techniques in subsection IV-C and subsection IV-D.
B. Experimental Setup
A. Data Set We have conducted our experiments using NVIDIA GPU
This research work uses different datasets including LIAR, RTX 3070, with 128 GB of RAM and Intel core i7 12th
PHEME, NELLA GT, and CREED. We also collected a generation. We have used Python to implement our algorithm,
dataset using a web scrapper from the popular News website of we employed different Python packages such as NLTK, Keras,
the Pakistani newspaper, The Dawn. The detail of the dataset Spacy, TensorFlow, and other related libraries. The dataset is
is listed below: split into 80 % for training and 20 % for testing purposes.
Authorized licensed use limited to: Politecnico di Milano. Downloaded on August 25,2023 at 18:28:20 UTC from IEEE Xplore. Restrictions apply.
COMPARISON OF TRAINING, TESTING AND VALIDATION DATASETS
Bert + Transformer 8 MHD Bert + Transformer 16 MHD Glove + Transformer 8 MHD Glove + Transformer 16 MHD
0.99
0.99
0.99
0.99
0.99
0.99
0.99
0.99
0.99
0.98
0.98
0.98
0.98
0.98
0.98
0.98
0.97
0.97
0.97
0.97
0.97
0.97
0.96
0.96
TABLE II: Comparison of different deep learning techniques and data sets using GloVe Encoding
C. Comparison of word embedding techniques The pre-trained models such as GloVe and BERT showed
the best performance compared to word embedding and prim-
The subsection compares the performance of different itive encoding techniques. The use of GloVe and BERT in the
encoding techniques including TF-IDF, Unigram and Bigram, research work resulted in an F1 Score of 0.98 and an Accuracy
BOWs, Word2Vec, LIWC, DOC2Vec, GloVe, and BERT. of 0.98 and 0.97, respectively. These results highlight the
For the experiment, we evaluated the performance with five effectiveness of pre-trained models in capturing the contextual
different deep learning models, LSTM, GRU, LSTM with meaning of words, which contributes to the high performance
attention, GRU with attention, and transformers with 8 and of deep learning models for time series prediction. It is
16 heads as shown in Table I. The results indicate that important to note that the performance of different encoding
encoding techniques that extract features directly from the techniques varies across the different deep-learning models
text, such as TF-IDF, N-Gram, and BOWs, did not perform used in the comparison. For example, the performance of TF-
well. The techniques which used features-based encodings IDF showed improvement when used with transformers with 8
or word embedding techniques such as Word2Vec, LIWC, and 16 heads, achieving an F1 Score of 0.78 and an Accuracy
and DOC2Vec performed better than TF-IDF, N-Gram, and of 0.77 and 0.78, respectively. However, the performance of
BOWs. The word embedding techniques achieved an F1 BOWs was almost consistent across different deep-learning
Score of 0.93 and an Accuracy of 0.93 and 0.92, respectively. models, with an F1 Score ranging from 0.83 to 0.85 and
The better performance of these word embedding techniques an Accuracy ranging from 0.84 to 0.86. We have used the
can be attributed to their ability to capture the semantic Dawn dataset for this experiment so as the next experiment
meaning of words through their distributed representation. we varied datasets and evaluated the performance of deep
learning techniques by keeping glove as our word embedding
Authorized licensed use limited to: Politecnico di Milano. Downloaded on August 25,2023 at 18:28:20 UTC from IEEE Xplore. Restrictions apply.
technique. [7] S. Narejo, E. Pasero, and F. Kulsoom, “EEG based eye state classifica-
tion using deep belief network and stacked autoencoder,” International
D. Comparison of deep learning for time series analysis Journal of Electrical and Computer Engineering (IJECE), vol. 6, no. 6,
pp. 3131–3141, 2016.
For a second test, we have considered different deep learn- [8] X. Chen, K. Liu, and J. Yang, “Improved LSTM-based stock price
ing techniques showing that RNN algorithms such as LSTM prediction by feature extraction,” Expert Systems with Applications,
vol. 78, pp. 20–29, 2017.
and GRU without attention give the lowest performance. If [9] Y. Bablani, S. Uqaili, S. Narejo, and H. Zahra, “Survey on text to text
attention is added to LSTM and GRU the performance is machine translation.”
improved as compared to basic RNN algorithms as shown [10] L. R. Medsker and L. Jain, “Recurrent neural networks,” Design and
Applications, vol. 5, pp. 64–67, 2001.
in Table II. LSTM and GRU give comparable performance, [11] F. Kulsoom, S. Narejo, Z. Mehmood, H. N. Chaudhry, A. Butt, and A. K.
while both techniques give similar performance when attention Bashir, “A review of machine learning-based human activity recognition
is added to them. The transformers with 8 and 16 multi-heads for diverse applications,” Neural Computing and Applications, pp. 1–36,
2022.
(MHDs) give the best performance. For a second test, we [12] S. Narejo and E. Pasero, “A hybrid approach for time series forecasting
have selected GloVe as a word-embedded technique as they using deep learning and nonlinear autoregressive neural network.” in
outperform all other encoding methods. It could be noticed INTELLI 2016 : The Fifth International Conference on Intelligent
Systems and Applications (includes InManEnt 20 16), 2016.
that Transformers with 8 and 16 MHD attained 0.97 and 0.98 [13] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
accuracy respectively over Dawn dataset for both BERT and computation, vol. 9, no. 8, pp. 1735–1780, 1997.
GloVe. Fig. 4 shows how accuracy varies over the duration of [14] R. C. Staudemeyer and E. R. Morris, “Understanding LSTM-a tutorial
into long short-term memory recurrent neural networks,” arXiv preprint
simulation for different deep learning techniques. It could be arXiv:1909.09586, 2019.
observed that the Dawn dataset’s accuracy becomes almost [15] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares,
constant after 80 epochs, while for the Liar dataset, this H. Schwenk, and Y. Bengio, “Learning phrase representations using
RNN encoder-decoder for statistical machine translation,” arXiv preprint
equilibrium is achieved at almost 90 epochs. Fig. 5 shows that arXiv:1406.1078, 2014.
our model does not overfit any selected embedding techniques [16] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of
or deep learning techniques. gated recurrent neural networks on sequence modeling,” arXiv preprint
arXiv:1412.3555, 2014.
V. C ONCLUSIONS [17] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
with neural networks,” Advances in neural information processing
This research aimed to evaluate the performance of different systems, vol. 27, 2014.
text data vectorization techniques and deep learning methods [18] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
jointly learning to align and translate,” arXiv preprint arXiv:1409.0473,
for fake news detection using time series analysis. The findings 2014.
of the study suggest that word embedding techniques and pre- [19] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel,
trained models such as GloVe and BERT outperform primitive and Y. Bengio, “Show, attend and tell: Neural image caption generation
with visual attention,” in International conference on machine learning.
encoding techniques. The results also indicate that the use PMLR, 2015, pp. 2048–2057.
of attention mechanisms in RNNs, such as LSTM and GRU, [20] H. Xu and K. Saenko, “Ask, attend and answer: Exploring question-
improves their performance, furthermore, transformers with 8 guided spatial attention for visual question answering,” in European
Conference on Computer Vision. Springer, 2016, pp. 451–466.
or 16 multi-heads remain the best-performing deep learning [21] V. Kazemi and A. Elqursh, “Show, ask, attend, and answer:
technique. These results offer valuable insights for future A strong baseline for visual question answering,” arXiv preprint
research in the field of fake news detection and the importance arXiv:1704.03162, 2017.
[22] H. Li, P. Wang, C. Shen, and G. Zhang, “Show, attend and read: A simple
of carefully considering the choice of encoding technique and and strong baseline for irregular text recognition,” in Proceedings of the
deep learning model based on the problem and dataset at AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp.
hand. The study compared various encoding techniques and 8610–8617.
[23] H. Mei, M. Bansal, and M. R. Walter, “Listen, attend, and walk: Neural
deep learning models on benchmark datasets. These results mapping of navigational instructions to action sequences,” in Thirtieth
demonstrate the importance of time series-based deep learning AAAI Conference on Artificial Intelligence, 2016.
methods in the effective detection of fake news. [24] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell,”
arXiv preprint arXiv:1508.01211, 2015.
[25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
R EFERENCES Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in
[1] H. N. Chaudhry, Y. Javed, F. Kulsoom, Z. Mehmood, Z. I. Khan, neural information processing systems, vol. 30, 2017.
U. Shoaib, and S. H. Janjua, “Sentiment analysis of before and after [26] W. Y. Wang, V. Garcı́a, and A. Narayanan, “Liar: A benchmark for fake
elections: Twitter data of us election 2020,” Electronics, vol. 10, no. 17, news detection,” arXiv preprint arXiv:1705.00648, 2017.
p. 2082, 2021. [27] A. Zubiaga, R. Procter, M. Liakata, and K. Bontcheva, “Pheme: A corpus
[2] A. Bondielli and F. Marcelloni, “A survey on fake news and rumour for rumour evaluation,” in Proceedings of the 52nd Annual Meeting of
detection techniques,” Inf. Sci., vol. 497, pp. 38–55, 2019. the Association for Computational Linguistics (Volume 1: Long Papers),
[3] S. Mishra, P. Shukla, and R. Agarwal, “Analyzing machine learning vol. 1, 2014, pp. 539–548.
enabled fake news detection techniques for diversified datasets,” Wireless [28] Y. Kai-Chieh, T. Heng, C.-H. Tam, and B. Zhang, “Nela-gt: A bench-
Communications and Mobile Computing, vol. 2022, pp. 1–18, 2022. mark for automatic fact checking and veracity prediction,” arXiv preprint
[4] L. Hu, S. Wei, Z. Zhao, and B. Wu, “Deep learning for fake news arXiv:2002.11906, 2020.
detection: A comprehensive survey,” AI Open, 2022. [29] A. Kavaljoglou, F. Martins, J. Almeida, B. Gledson, and G. Weikum,
[5] X. Zhou and R. Zafarani, “A survey of fake news: Fundamental theories, “Creed: A resource for fact-checking news articles on the web,” in
detection methods, and opportunities,” ACM Comput. Surv., 2020. International Conference on Web Intelligence. Springer, 2017, pp. 647–
[6] S. Narejo and E. Pasero, “Time series forecasting for outdoor tempera- 655.
ture using nonlinear autoregressive neural network models.” Journal of
Theoretical & Applied Information Technology, vol. 94, no. 2, 2016.
Authorized licensed use limited to: Politecnico di Milano. Downloaded on August 25,2023 at 18:28:20 UTC from IEEE Xplore. Restrictions apply.