A Comparative Study On Fake Profile Identification Using Different Machine Learning Techniques
A Comparative Study On Fake Profile Identification Using Different Machine Learning Techniques
v
TABLE OF CONTENTS
vi
LIST OF FIGURES
vii
LIST OF ABBREVIATIONS
ABBREVIATIONS EXPANSION
ML MACHINE LEARNING
DT DECISION TREE
viii
CHAPTER 1
INTRODUCTION
It has become quite unpretentious to obtain any kind of information from any
source across the world by using the Internet. The increased demand of social
sites permits users to collect abundant amount of information and data about
users. Huge volumes of data available on these sites also draw the attention of
fake users. Twitter has rapidly become an online source for acquiring real-time
information about users. Twitter is an Online Social Network (OSN)where users
can share anything and everything, such as news, opinions, and even their
moods. Several arguments can be held over different topics, such as politics,
current affairs, and important events. When a user tweets something, it is
instantly conveyed to his/her followers, allowing them to outspread the received
information at a much broader level [2].
With the evolution of OSNs, the need to study and analyze users‟
behaviors in online social platforms has intensified . Many people who do not
have much information regarding the OSN scan easily be tricked by the
fraudsters. There is also a demand to combat and place a control on the people
who use OSNs only for advertisements and thus spam other people‟s accounts.
Recently, the detection of spam in social networking sites attracted the attention
of researchers. Spam detection is a difficult task in maintaining the security of
social networks. It is essential to recognize spams in the OSN sites to save
users from various kinds of malicious attacks and to preserve their security and
privacy. These hazardous maneuvers adopted by spammers cause massive
destruction of the community in the real world. Twitter spammers have various
objectives, such as spreading invalid information, fake news, rumors, and
spontaneous messages.
1
Spammers achieve their malicious objectives through advertisements and
several other means where they support different mailing lists and subsequently
dispatch spam messages randomly to broadcast their interests. These activities
cause disturbance to the original users who are known as non-spammers. In
addition, it also decreases the repute of the OSN platforms. Therefore, it is
essential to design a scheme to spot spammers so that corrective efforts can be
taken to counter their malicious activities. Several research works have been
carried out in the domain of Twitter spam detection. To encompass the existing
state-of the-art, a few surveys have also been carried out on fake user
identification from Twitter. Tingmin et al. Provide a survey of new methods
and techniques to identify Twitter spam detection.
The survey presents a comparative study of the current approaches. On
the other hand, the authors in conducted a survey on different behaviors
exhibited by spammers on Twitter social network. The study also provides a
literature review that recognizes the existence of spammers on Twitter social
network. Despite all the existing studies, there is still a gap in the existing
literature. Therefore, to bridge the gap, we review state-of-the-art in the
spammer detection and fake user identification on Twitter. Moreover, this
survey presents taxonomy of the Twitter spam detection approaches and
attempts to offer a detailed description of recent developments in the domain.
The aim of this project is to identify different approaches of spam
detection on Twitter and to present a taxonomy by classifying these approaches
into several categories. For classification, we have identified four means of
reporting spammers that can be helpful in identifying fake identities of users.
Spammers can be identified based on: (i) fake content, (ii) URL based spam
detection, (iii) detecting spam in trending topics, and (iv)fake user identification
.
2
CHAPTER 2
LITERATURE SURVEY
Twitter spam has become a critical problem nowadays. Recent works focus on
applying machine learning techniques for Twitter spam detection, which make
use of the statistical features of tweets. In this labeled tweets data set, however,
we observe that the statistical properties of spam tweets vary over time, and
thus, the performance of existing machine learning-based classifiers decreases.
This issue is referred to as “Twitter Spam Drift”. In order to tackle this problem,
we first carry out a deep analysis on the statistical features of one million spam
tweets and one million non-spam tweets, and then propose a novel Lfun
scheme. The proposed scheme can discover “changed” spam tweets from
unlabeled tweets and incorporate them into classifier's training process. A
number of experiments are performed to evaluate the proposed scheme. The
results show that our proposed Lfun scheme can significantly improve the spam
detection accuracy in real-world scenarios.
The popularity of Twitter attracts more and more spammers. Spammers send
unwanted tweets to Twitter users to promote websites or services, which are
harmful to normal users. In order to stop spammers, researchers have proposed
a number of mechanisms. The focus of recent works is on the application of
machine learning techniques into Twitter spam detection. However, tweets are
retrieved in a streaming way, and Twitter provides the Streaming API for
4
developers and researchers to access public tweets in real time. There lacks a
performance evaluation of existing machine learning-based streaming spam
detection methods. In this paper, we bridged the gap by carrying out a
performance evaluation, which was from three different aspects of data, feature,
and model.
A big ground-truth of over 600 million public tweets was created by
using a commercial URL-based security tool. For real-time spam detection, we
further extracted 12 lightweight features for tweet representation. Spam
detection was then transformed to a binary classification problem in the feature
space and can be solved by conventional machine learning algorithms. We
evaluated the impact of different factors to the spam detection performance,
which included spam to nonspam ratio, feature discretization, training data size,
data sampling, time-related data, and machine learning algorithms. The results
show the streaming spam tweet detection is still a big challenge and a robust
detection technique should take into account the three aspects of data, feature,
and model.
In this paper, we view the task of identifying spammers in social networks from
a mixture modeling perspective, based on which we devise a principled
unsupervised approach to detect spammers. In our approach, we first represent
each user of the social network with a feature vector that reflects its behaviour
and interactions with other participants. Next, based on the estimated users
feature vectors, we propose a statistical framework that uses the Dirichlet
distribution in order to identify spammers. The proposed approach is able to
5
automatically discriminate between spammers and legitimate users, while
existing unsupervised approaches require human intervention in order to set
informal threshold parameters to detect spammers. Furthermore, our approach is
general in the sense that it can be applied to different online social sites. To
demonstrate the suitability of the proposed method, we conducted experiments
on real data extracted from Instagram and Twitter.
Law Enforcement Agencies cover a crucial role in the analysis of open data and
need effective techniques to filter troublesome information. In a real scenario,
Law Enforcement Agencies analyze Social Networks, i.e. Twitter, monitoring
events and profiling accounts. Unfortunately, between the huge amount of
internet users, there are people that use microblogs for harassing other people or
spreading malicious contents. Users' classification and spammers' identification
is a useful technique for relieve Twitter traffic from uninformative content.
This work proposes a framework that exploits a non-uniform feature
sampling inside a gray box Machine Learning System, using a variant of the
Random Forests Algorithm to identify spammers inside Twitter traffic.
Experiments are made on a popular Twitter dataset and on a new dataset of
Twitter users. The new provided Twitter dataset is made up of users labeled as
spammers or legitimate users, described by 54 features. Experimental results
demonstrate the effectiveness of enriched feature sampling method
6
CHAPTER 3
METHODOLOGY
• Because of Privacy Issues the Facebook dataset is very limited and a lot
of details are not made public.
• More complex