Social Media Amharic Fake News Detection Using NLP Techniques With SVM Algorithm
Social Media Amharic Fake News Detection Using NLP Techniques With SVM Algorithm
Social Media Amharic Fake News Detection Using NLP Techniques With SVM Algorithm
Chapter 1 Introduction
Background of the study
Currently the remarkable advancement of technology brings a dramatic change in means of
communication to the digital world via a world wide web so-called internet. The Internet is a global
computer network providing a variety of information and communication facilities, consisting of
interconnected networks using standardized communication protocols. Social media networks are
communication facilities provided by the internet infrastructure .In other words Social media is
computer-based technology that facilitates the sharing of ideas, thoughts, and information through the
building of virtual networks and communities. Users engage with social media via computer, tablet or
smartphone via web-based software or web application, often utilizing it for messaging.
Nowadays social media plays a great role as a source of information all over the world. Social
networks are dedicated website or other application which enables users to communicate with each
other by posting information, comments, messages, images, etc. the different types of social medias
used in exchange of information among peoples includes Social networks, Bookmarking sites,
Social news, Media sharing websites, Microblogging, Blog comments and forums, Social Review
Sites, Community Blogs and Sharing Economy Networks.
Social networks A social networking site is a social media site that allows you to connect with people
who have similar interests and backgrounds. Facebook, Twitter, and Instagram are three of the most
popular examples of a social network website. Most social network sites let users share thoughts,
upload photos and videos, and participate in groups of interest.
Bookmarking sites (Pinterest, Flipboard, Diggs) allow users to save and organize links to any number
of online resources and websites. A great feature of these sites is the ability for the users to “tag”
links, which makes them easier to search, and invariably, share with their followers. StumbleUpon is
a popular example of a bookmarking site.
A social news site allows its users to post news links and other items to external articles. Users then
proceed to vote on said items, and the items with the highest number of votes are most prominently
displayed. A good example of a social news site is Reddit.
Media sharing websites allow users to share different types of media, with the two main ones being
image sharing and video hosting sites. Most of these sites also offer social features, like the ability to
create profiles and the option of commenting on the uploaded images or videos. These platforms
mostly encourage user-generated content where anyone can create, curate, and share the creativity
that speaks about them or spark conversations. YouTube still remains the most well-known media
sharing site in the world.
Microblogging These are just what they sound like, sites that allow the users to submit their short-
written entries, which can include links to product and service sites, as well as links to other social
media sites. These are then posted on the ‘walls’ of everyone who has subscribed to that user’s
account. The most commonly used microblogging website is Twitter.
Blog comments and forums an online forum is a site that lets users engage in conversations by
posting and responding to community messages. A blog comment site is the same thing except being
a little more focused. The comments are usually centered around the specific subject of the attached
blog. Google has a popular blogging site aptly titled, Blogger. However, there are a seemingly
endless number of blogging sites, particularly because so many of them are niche-based, unlike the
universal appeal of general social media sites.
Social Review Sites (such as TripAdvisor, Yelp, Foursquare): Review sites like TripAdvisor and
Foursquare show reviews from community members for all sorts of locations and experiences. This
keeps people out of the dark and allows them to make better planning or decisions when it comes to
choosing a restaurant for their date.
Community Blogs (like Medium, Tumblr) Sometimes all you want to do is share that one message,
and really not everyone on the internet wants to invest in running and maintaining a blog from a self-
hosted website. This is where shared blogging platforms like Medium give people a space to express
their thoughts and voice.
Sharing Economy Networks (such as: Airbnb, Pantheon, Kickstarter) While it might not occur to you
directly, websites like Airbnb aren't just to find holiday rentals or activities. These sharing economy
networks bring people who have got something they want to share together with the people who need
it.
All of the above are taken from this link https://seopressor.com/social-media-marketing/types-of-
social-media/
Currently social networks like Facebook, twitter become most popular in providing the main source
of news feeds.
As an increasing amount of our lives is spent interacting online through social media platforms, more
and more people tend to seek out and consume news from social media rather than traditional news
organizations. The reasons for this change in consumption behaviors are inherent in the nature of
these social media platforms: (i) it is often more timely and less expensive to consume news on social
media compared with traditional news media, such as newspapers or television; and (ii) it is easier to
further share, comment
1. Web space
The website should provide the users free web space to upload content.
2. Web address
The users are given a unique web address that becomes their web identity. They can post and share
all their content on this web address.
3. Build profiles
Users are is asked to enter personal details like name, address, date of birth, school/college education,
professional details etc. The site then mines the personal data to connect individuals.
6. Enable conversations
Members are given the rights to comment on posts made by friends and relatives. The conversations
are a great social connection.
Malicious
Currently social media become the main source of Information even mainstream media changed
towards it. This is due to the lowest fee, easiness to use, rapid dissemination of information and
reaction to events makes social media an ideal place to express emotions, feelings, claims and
thoughts of individuals. The major cons in social media is absence of a service to check
truthfulness the posted information for users. Absence of verifying information service created
problem of fake information so-called fake news. The extensive spread of fake news has the
potential for extremely negative impacts on individuals and society, reducing the government
credibility, even endangering the national security.
General objective:
To detect Amharic fake news on social media using NLP techniques and support vector
machines.
Specific objective:
⮚ To collect and prepare manually labeled data of Amharic news articles.
⮚ To explore relevant textual features of Amharic news articles.
⮚ To extract prominent textual features of Amharic news articles.
⮚ To model Amharic fake news detection as a binary classification task
⮚ To adopt an inherent two-group attribute of a support vector machine for Amharic fake
news detection.
⮚ To train and evaluate the model
Research questions
⮚ What are relevant textual linguistic features of an Amharic news article?
⮚ How to use relevant features of Amharic news articles to classify news articles as fake or
real?
⮚ Which features will lead to high accuracy in a classification task?
Chapter 2. literature review
Despite the fact that fake news is the major threat that jeopardizes individuals, public,
government credibility which leads to endanger national security as a whole. Due to lack of
publicly available Amharic fake news dataset and the under-resourced property of Amharic
language many efforts have not been done to combat Amharic fake news using natural language
processing and machine learning in online social media. Fake news detection in a hot and news
research area started in recent years. In this section related works are organized as follows.
Hussain et.al. (2020) worked on Detection of Bangla Fake News.in this research Inverse
Document Frequency Vectorizer and Countvectorizer has been used as feature extraction and
Support Vector Machine (SVM) and Multinomial Naive Bayes (MNB) classifiers to recognize
Bangla fake news. The experiment was conducted using a bangala language data set collected
from different sites of Bangladeshi online news sources with a total count of 2500 news article
1548 rales and 940 fakes. Results showed that SVM with linear kernel gives a 96.64 percent
accuracy overperforming MNB with a 93.32 percent accuracy.
Smitha et.al (2020) the main focus of this work was evaluating performance of seven various
machine learning classifiers namely Linear SVM, Random Forest, Logistic Regression, XG-
Boost, Gradient Boosting, Decision Tree, neural network for fake news detection task. These
classifiers were evaluated with three different feature engineering methods namely count
vector, TF-IDF and word embedding on Kaggle dataset (two classes labeled i.e. either fake/real).
Results showed that the highest accuracy obtained was SVM Linear classification algorithm with
TF-IDF feature extraction and Neural Network with Count vectorizer, 0.94 accuracy. Finally,
authors argued despite the fact that Neural Network with Count vectorizer could score the same
highest accuracy like SVM, from running time and complexity point of view Linear SVM
classifier is preferable for their proposed fake news detection system.
Abdulaziz et.al. (2020) in this paper authors conducted an empirical comparison four well-
known machine learning algorithms namely random forest, the Naïve Bayes, the neural network,
and the decision trees with N-gram model for feature extractions on publicly available dataset
called LIAR (multi-class). Results obtained from the experiment showed that trigram feature
shows with Naïve Bayes classifier defeats the other algorithms remarkably on this dataset
achieving the highest accuracy 99%. Besides in determining the training time of each algorithm
in seconds were 54.3 s 1.3 s, 420.2 s, 540.1 s, random forest, the Naïve Bayes, the neural
network respectively. Therefore, this paper argued that the Naïve Bayes outperforms with
highest accuracy as well as minimum computational running time.
Ahmed et.al (2017) this research has been conducted with the intention of “Detecting Online
Fake News Using N-Gram Analysis and Machine Learning Techniques''. This paper used N-
gram models and TF IDF for feature generation evaluated among six various machine learning
classifies namely Stochastic Gradient Descent (SGD), Support Vector Machines (SVM), Linear
Support Vector Machines (LSVM), K-Nearest Neighbor (KNN) and Decision Trees (DT) on
ISOT(two class labeled) dataset which is collected from Kaggle and rueters.com by the
authors .Results showed that the best performance model was using Term Frequency-Inverse
Document Frequency (TF-IDF) as feature extraction technique, and Linear Support Vector
Machine (LSVM) as a classifier, with an accuracy of 92%.