Identifying Fake News Via Machine Learning and Web Scrapping
Identifying Fake News Via Machine Learning and Web Scrapping
https://doi.org/10.22214/ijraset.2023.52778
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue V May 2023- Available at www.ijraset.com
Abstract: After digitalization, the increase in use of social media has led the flow of information among the social network users
unchecked.
The news is immediately transported to social networks, where it is quickly read, marked with opinions (on Facebook), and
shared (on Twitter and Facebook) without being verified as accurate or false numerous times.
In the modern world, fake news is an issue that has grown difficult to detect. Its impact on the judgement of commonfolk is
noticeable, society has witnessed numerous events where the rampant flow of unverified news had affected the society as a
whole. The sharing of fake news has become a major issue to the society. Our project aims to aid in reduction of e-crimes like
use of social media for sharing of fake content. This will be achieved by training the machine to identify text based fake content.
Using this the time for work of filtering out the fake content will be reduced by great lengths whilst helping the mass to get
verified and credible news.
Keywords: Machine Learning, Natural Language Processing, Naïve Bayes, and Fake News.
I. INTRODUCTION
Fake news detection comes under the text classification [1] and in simpler term is the function of classifying the given news as true
or false.
The several types of fake news comprise disinformation (with a goal to intentionally deceive the public), misinformation (wrong
information without motive), and rumours, clickbait (misleading headlines), fake news, parodies [2].
Recent studies [2, 3] show that Fake news is spreading at an unprecedented rate, which has led to its widespread dissemination. The
effects of such news can be observed in post-election instability or in anti-vaccine groups that hindered the efforts put against
COVID-19. Therefore, it is crucial to halt the circulation of false information as soon as possible. The dissemination of false
information against vaccinations and the myth that vaccines are harmful are both glaring examples of this.
The project follows to the previously put forward ideology [4] while increasing the number and quality of the dataset and reducing
the number of inputs to lowest possible. Making the project a least user input to maximum accuracy experiment, which allowed us
to come to a few conclusions that are further discussed in the report.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 5504
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue V May 2023- Available at www.ijraset.com
III. METHODOLOGY
A. Flowchart/ Theory
Flowchart
IV. DESIGN
The input as mentioned is aimed to be the least which led to adopting the naïve bayes algorithm, thus the only input needed is the
keywords to be searched. Using HTML, CSS and Flask the webpage is developed through which the user interacts with the
backend. The backend will require importing libraries like Tweepy, Pandas, etc. Most recent tweet containing the query passed by
user through the front end will be extracted using tweepy which is an open-source Python module that provides access to the Twitter
API. The scrapped tweet is then stored in database using MySQL and by natural language processing (NLP) the data is filtered.
Further using Naïve Bayes approach probability for the event is determined by using the dataset which is created using web
scrapping consisting of true statements and false statements. The final result obtained at last is show to the user at the front end.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 5505
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue V May 2023- Available at www.ijraset.com
V. EXPERIMENTS
A. Experiment 1
In first experiment, the model was trained via dataset of total 44000 labelled statements. The model achieved accuracy of 95.57%. It
took 144 seconds to test the model.
B. Experiment
Since the accuracy was good enough, dataset was kept the same, now the major setback was time. Thus, to reduce the time
consumption parallel computing was implemented. This helped in reducing the time to 89 seconds.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 5506
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue V May 2023- Available at www.ijraset.com
C. Experiment 3
It has been identified that now we must create a system that can keep up with the rate of updating with real time information, thus
web scrapping has been implemented. In order to build a dataset that could update with real-time information, the prior dataset was
discarded. It was achieved via web scrapping sites like PolitiFact. PolitiFact is a website which previous partnered with Facebook
(now meta) to create a similar engine to ours which rated a post’s credibility. The website PolitiFact assigns values as false, half-
true, mostly-false, pants-fire, true, mostly-true, barely-true to any news as its credibility score. Given that the proposed model is a
classifier type, such target values create noise in model. So false, half-true, mostly-false and pants-fire were taken as false values
while true, mostly-true, barely-true as true values.
The website also offers a flip-o-meter that displays half-flip, full-flop, and no-flip values, none of which are useful because they are
neither true nor false, therefore we eliminated them from the training model.
Beautiful soup and requests were libraries imported for web scrapping and web scraped data was organized into csv file using
pandas. Current dataset has 3521 false values and 976 true values, so to resolve the class imbalance smote (oversampling), under
sampling, smoteenn (oversamling+undersampling) were applied as an experiment.
Balancing techniques had better sensitivity but in all other parameter figures were better without performing balancing. The gain in
sensitivity was not significant enough compared to the decrease in precision, accuracy and F1 score. As the dataset increased the
accuracy jumped to 79.11%, concluding that with time the accuracy of the model also shows parallel growth. We also switched
from system-based source-code editor to google collab resulting in increasing processing speed. The present execution time is only
15 seconds given the size of the dataset is much smaller as compared to previous experiments.
Confusion matrix was used to define the performance of our classifier algorithm, the result is shown below:
Calculated by confusion matrix the values of sensitivity, specificity and precision were 85.42%, 52.60% and 88.34% respectively.
The overall growth with each experiment performed can be seen in the graph below,
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 5507
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue V May 2023- Available at www.ijraset.com
B. Discussion
The entire experiment has led us to the conclusion that working with real-time dataset is most prominent. A machine that can update
the dataset continuously with fewer, more pertinent statements would therefore perform better. As demonstrated by the fourth
experiment, accuracy may be subpar at first but will increase over time. The ideal system will be able to update statements every
second, however if a query where past event is crucial segment is run, old news must also be preserved for precise output. So,
creating a large database seems like an option but it turns into a stalemate as it will increase processing time exponentially as the
data keeps getting scrapped. Concluding that keeping dataset massive or limited both options depend on the situation and neither
choice is truly advantageous. Techniques like RNN (recurrent neural networks) can also be implemented which extends the memory
so the results of the query executed can also be used as an input thus increasing accuracy with time. This project aimed at getting
best result with minimum input, which was words of the statement, but new parameters can also be introduced which will enable the
use of SVM and Radom Forest algorithms too.
VII. CONCLUSION
This paper presents a method of detecting fake news using naïve bayes, trying to reduce fake news circulation by detecting whether
the news/tweet is true or false prior to spreading. The project mainly focused developing a machine which can detect fake news with
least input which helps in preserving any user’s privacy. Thus, various experiments were performed and conclusion were arrived on,
which were discussed in the paper.
VIII. ACKNOWLEDGEMENT
The project has involved our group's efforts. Without our individual efforts and the kind assistance of our project guide, it wouldn’t
have been achievable. Therefore, we would want to express our gratitude to sir for guiding us through this project.
REFERENCES
[1] Liu, C., Wu, X., Yu, M., Li, G., Jiang, J., Huang, W., Lu, X.: A two-stage model based on BERT for short fake news detection. In: Lecture Notes in Computer
Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 11776 LNAI, pp. 172–183 (2019).
https://doi.org/10.1007/978-3-030-29563-9_17
[2] Zhou, X., Zafarani, R.: A survey of fake news: fundamental theories, detection methods, and opportunities. ACM Comput. Surv. (2020).
https://doi.org/10.1145/3395046
[3] Shu, K., Wang, S., Liu, H.: Beyond news contents: The role of social context for fake news detection. In: WSDM 2019—Proceedings of 12th ACM
International Conference on Web Search Data Mining, vol. 9, pp. 312–320 (2019).
[4] Ciprian-Gabriel, Cusmuliuc & Cusmuliuc, Georgiana & Iftene, Adrian. (2018). IDENTIFYING FAKE NEWS ON TWITTER USING NAÏVE BAYES, SVM
AND RANDOM FOREST DISTRIBUTED ALGORITHMS.
[5] Hoens, T.R., Polikar, R., Chawla, N.: V: Learning from streaming data with concept drift and imbalance: an overview. Prog. Artif. Intell. 1, 89–101 (2012)
[6] Raza, S., Ding, C. Fake news detection based on news content and social contexts: a transformer-based approach. Int J Data Sci Anal 13, 335–362 (2022).
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 5508