Improving Accuracy of Twitter Fake Profile Detection Using Deep Learning

The document discusses a study on improving the detection of fake Twitter profiles using deep learning techniques, specifically neural networks combined with principal component analysis (PCA). It highlights the growing issue of fake accounts on social media, which contribute to misinformation and spam, and presents a methodology that includes data preprocessing, feature engineering, and model training. The results indicate that the proposed system enhances accuracy and efficiency in identifying fraudulent accounts compared to traditional methods.

Uploaded by

International Journal of Innovative Science and Research Technology

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Improving Accuracy of Twitter Fake Profile Detection Using Deep Learning

Uploaded by

International Journal of Innovative Science and Research Technology

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Volume 10, Issue 1, January – 2025 International Journal of Innovative Science and Research Technology (IJISRT)

ISSN No:-2456-2165 https://doi.org/10.5281/zenodo.14730663

IJISRT25JAN816 www.ijisrt.com

Improving Accuracy of Twitter Fake Profile

Detection Using Deep Learning
Vaibhavi Kakade1; Chavan Akanksha2, Vaibhav Dhakne3; Prajakta Nalwade4
Savitribai Phule Pune University
Shivnagar Vidya Prasarak Mandal at Malegaon (Bk) Tal -
Baramati, Dist – Pune

Publication Date: 2025/01/25

Abstract
Outline the issue of fake accounts on popular social media platforms like Twitter, which spread false information, malicious content,
and spam. Online social networks have grown rapidly, with billions of users worldwide. This growth has led to many fake ac-
counts, causing problems like spam, fake news, and political manipulation. Fake accounts can also harm businesses financially and
damage their reputation. Therefore, detecting these fraudulent accounts is crucial. Recently, researchers have been using neural
network algorithms to identify fake accounts more effectively. Our system uses several types of neural networks, including
feedforward and recurrent neural networks, as well as deep learning models, to address this issue. Specifically, we combine artificial
neural networks (ANN) with principal component analysis (PCA) to create a reliable system for spotting fake accounts on social
media. By collecting and processing data thoroughly, extracting important features, and training the ANN, we show that our method
is better than traditional ones at detecting fake accounts. Our results highlight the potential for greater accuracy and efficiency in
protecting the integrity of online social networks.

I. INTRODUCTION rapid growth of social media platforms like Facebook,

LinkedIn, Twitter, and Instagram has made it possible for
Over the last two decades, there have been tremendous over half of the world's population to be active internet users
growths of social networking, bringing millions of users from and engage in social media activities due to the enormous
all walks of life to this new form of online interaction. advancements in wireless communication technology. The
However, such rapid growth calls for a raising concern related proliferation of fake accounts has led to serious problems like
to a new phenomenon: fake accounts that represent nobody at the spread of fake news, political manipulation, hate speech,
all. These bot accounts spread false information and and spam activities that threaten the legitimacy and
manipulate web ratings by generating spam, amongst other dependability of online social networks, even though the user
abusive practices which are prohibited on sites such as base has grown exponentially. Machine learning techniques
Twitter. Examples of inappropriate use of social media are becoming essential for identifying phony accounts in
include automated interaction, attempts at manipulating or order to lessen these difficulties. But, cyber-thieves keep
misleading end-users, and worse behaviors such as posting designing bots which can transcend such detection strategies.
malicious links, aggressive following and follow- So, it is an ongoing cat-and-mouse game between the
ing/unfollowing others, multiple account creation, updating detection algorithms and malefactors. In this study, we
duplication, and reply and mention functions abuse. approached a structured approach: the literature re-view,
Authentic posts, however, adhere to the site's guidelines. This explanation of the proposed detection technique, and
phenomenon has far- reaching effects since real-time tweets comparative analysis of results from various algorithms; thus,
and messages are received, allowing information to swiftly it gave a contribution towards the development of techniques
reach a huge number of users. One of the biggest problems to identify and fight the spread.
with social media is spammers, who utilize their accounts for
a variety of nefarious purposes, like spreading II. LITERATURE SURVEY
misinformation that can harm any company and have an im-
pact on society as a whole.  Detecting Twitter Fake Accounts using Machine Learning
and Data Reduction Techniques. In order to detect phony
Therefore, the research's goals will be focused on social media accounts, previous research used a variety
addressing the problem of identifying phony profile accounts of datasets and machine learning techniques. For more
on the Twitter online social network. This technology's precise detection, recent research favors deep learning
primary goal is to stop the proliferation of different types of models like neural networks. This underscores the
bogus news, advertisements, and followers. Furthermore, the transition to neural network-based methods and under-

Vaibhavi Kakade; Chavan Akanksha; Vaibhav Dhakne; Prajakta Nalwade., (2025), Improving Accuracy of Twitter Fake Profile
Detection Using Deep Learning. International Journal of Innovative Science and Research Technology, 10(1), 828-832.
https://doi.org/10.5281/zenodo.14730663 828
scores the importance of diverse datasets. preprocessing and deep learning model implementation
 Twitter Fake Account Detection. Rise in fake accounts on phases, libraries such as Tensor Flow may be utilized.
Twitter poses risks such as spreading fake news and spam. However, for the sake of this research, we focus on a deep
To differentiate between authentic and fraudulent users, learning approach that is incorporated into a deep learning
feature-based detection techniques track user activity. framework.
Dif- ferent attribute sets and detection techniques
were investigated in a number of studies. In order to A. Dataset Description:
efficiently manage numerical attributes, some studies Our study makes use of the "MIB" dataset, which was
concentrate on discretization approaches. made freely available by Crescietal. (2015). 5,301 Twitter
 Detecting Fake Accounts on Social Media. Introduces accounts in all, divided into actual and fraudulent accounts,
SVM-NN, a novel algorithm for fake Twitter account make up this dataset.
detection. Techniques for dimension reduction and
feature selection were used in preprocessing. SVM-NN  Actual Accounts:
achieves good classification accuracy, outperforming
other classifiers.  The 469 accounts in the "Fake Project" dataset were
 Detection of Fake Profile in Online Social Networks gathered by human researchers at IIT-CNR in Pisa, Italy.
Using Machine Learning. suggests using SVM-NN to  Two sociologists from the University of Perugia in Italy
detect phony Twitter accounts. Machine learning methods confirmed that the 1481 authentic human stories in the
and feature selection strategies are used, with SVM-NN "E13 (elezioni 2013)" dataset are accurate.
demonstrating the best results. It has been observed that
correlation-based feature selection methods are more  False Accounts:
successful than PCA.
 In Using Machine Learning to Detect Fake Identities:  The dataset known as "Fastfollowerz" comprises 1337
Humans versus Bots The author trained and assessed the accounts. The dataset titled "Intertwitter" has 1169
machine learning models using a corpus of social media accounts.
accounts. In order to improve the detection of fraudulent  In 2013, researchers bought 845 accounts from the market
accounts, engineering elements were added to the to create the "Twitter- technology" dataset.
corpus, which was made up of social network attributes.
Metrics like accuracy, F1 score, and precision- recall area B. Data Preprocessing;
under curve (PR-AUC) were used to assess each model's Data Cleaning and Missing Value Imputation: If a
efficacy. significant percentage of values in the dataset are missing, we
use suitable methods such as mean/median imputation or
III. RESEARCH METHODOLOGY deletion to resolve the missing values. This guarantees that
the model has all the data it needs to be trained.
This section describes the process used to identify
phony Twitter profiles using a deep learning model. Data  User Description Text Preprocessing:
preprocessing, feature engineering, and model training and
evaluation are the three main phases of our methodology.  We apply Natural Language Processing (NLP) methods
Preprocessing Data: The MIB dataset, which contains to user descriptions, which are textual elements that may
information on user profiles and activities, is first contain useful information. This procedure entail Finding
preprocessed. Natural Language Processing (NLP) methods the prevailing language for each description is essential
such as language recognition, tokenization, stop word for the next steps in the language detection process.
removal, and TF-IDF conversion are applied to textual Tokenization is the process of breaking up text into
features, especially user descriptions. Through this discrete words or characters to create a sequence that may
procedure, textual data is converted into a numerical be processed further.
representation that deep learning models can use. To make  Stop Word Removal: To lower dimensionality and
sure all features numerical and post-NLP— contribute concentrate on pertinent content, frequent words with
equally during training, feature scaling is also used. Feature little meaning (such as "the" and "a") are eliminated.
Engineering: In addition to the MIB dataset's raw  Alternatives to Label Encoding for Textual Feature
characteristics, we investigate the development of new Representation: One-Hot Encoding: (if there aren't too
features that could improve the model's capacity to many distinct descriptions): With this method, a new
differentiate between authentic and fraudulent accounts. This binary feature is produced for every distinct category. The
could entail finding patterns suggestive of questionable associated feature value is set to 1 if a text sample falls
activity (e.g., high favorite count with low status count) or into that category and to 0 otherwise. This lets the model
computing ratios between current features (e.g., followers discover connections between features while maintaining
count / friends count). Model Training and Evaluation A deep the categorical nature of the input.
learning model, namely a feed-forward neural network,  Word embeddings: these are useful when working with
serves as the primary classification engine. We test different big datasets or when attempting to capture semantic links.
network topologies, including the number of neurons and Words are represented as vectors in a high-dimensional
hidden layers, to obtain optimal performance. The model is space using this potent approach. Semantic relationships
trained on a portion of the preprocessed and maybe feature- are captured by placing words that have comparable
engineered MIB dataset. Evaluation parameters including as meanings closer together. GloVe and Word2Vec are two
accuracy, precision, recall, and F1-score are used to assess the well- liked word embedding techniques. To effectively
model's ability to identify fraudulent profiles on a separate represent textual data, you can use these pre-trained
hold- out test set. Software and Tools: During the data embeddings into your deep learning model.
829
Data Reduction After data preparation, we employ PCs. Our deep learning model for false profile identification
Principal Component Analysis (PCA) to decrease the is then trained using this reduced- dimensionality dataset,
dimensionality of the numerical features in the MIB dataset. which could result in quicker training times and better model
High dimensionality can cause the "curse of dimensionality," performance.
which makes learning difficult because of sparse data and can
also make training durations longer. The majority of the D. Algorithms:
variance in the data is captured by principal components
(PCs), a smaller set of features found using PCA. This allows  NLP:
us to reduce the amount of characteristics while preserving A subfield of artificial intelligence (AI) called natural
vital information. Our goal is to strike a compromise between language processing (NLP) deals with the use of natural
computational efficiency and information retention by language in computer-human interaction. It includes a broad
choosing a subset of informative PCs. Our deep learning range of activities designed to make it possible for machines
model for false profile identification is then trained using this to comprehend, interpret, and produce data in human
decreased dimensionality dataset, which could result in language.
quicker training times and better model performance.
Architecture Model We use a feed-forward neural network  Breaking down text into smaller parts, such as words,
architecture with five consecutive layers in our deep learning phrases, or sentences, is known as tokenization. For many
model for Twitter false profile identification. By successfully NLP projects, this is the initial step.
classifying actual and fraudulent profiles, this architecture  Part-of-Speech (POS) Tagging: giving each word in a
seeks to capture complex interactions between the phrase a grammatical tag (such as noun, verb, or
preprocessed characteristics (numerical and post-NLP textual adjectival).
elements).  Named Entity Recognition (NER): NER is the process of
identifying and categorizing entities that are mentioned in
Figure 1: Architecture of Neural Networks 64 neurons text, including names of people, groups, locations, dates,
make up the first layer, which is dense. Since dense layers etc.4. Parsing: examining a sentence's grammatical
are fully coupled, every neuron in the data is able to receive structure to determine its syntactic linkages.
input from every feature. This layer applies a ReLU  Sentiment analysis: Identifying the sentiment—whether
(Rectified Linear Unit) activation function and combines neutral, negative, or positive—expressed in a document.
these inputs in a weighted linear fashion. By adding  Machine Translation: Automatically translating text
nonlinearity, ReLU enables the model to recognize between languages.
increasingly intricate patterns. It promotes effective learning  Writing Generation: Producing writing that resembles that
by letting only positive values through. A dropout layer is of a human being in response to prompts or contexts.
added after the initial dense layer. During training, dropout
 Question Answering: The process of automatically
randomly deactivates a predetermined percentage of neurons
determining responses to queries in natural language.
(for example, 20%). By decreasing neuronal codependency
 Text Summarization: distilling lengthy texts into
and promoting the model to learn more resilient properties,
concise synopses while keeping the most crucial details.
this lessens overfitting. With 32 neurons, the third and fourth
levels are similarly thick. These layers extract higher-level  Topic Modeling: Determining the primary subjects
covered in a group of documents.
representations from the data and gradually improve the
learned features. After the second and fourth dense layers,
 PCA:
dropout layers are positioned carefully to increase feature
robustness and avoid over-fitting. One neuron with a sigmoid Text heads organize the topics on a relational,
hierarchical basis. For example, the paper title is the primary
activation function makes up the last layer. Every neuron in
text head because all subsequent material relates and
the preceding dense layer sends information to this cell. The
elaborates on this one topic. If there are two or more sub-
model predicts that a profile is either true (closer to 0) or
fraudulent (closer to 1), and the sigmoid function returns a topics, the next level head (uppercase Roman numerals)
number between 0 and 1. The model can learn increasingly should be used and, conversely, if there are not at least two
complicated feature representations from the data thanks sub-topics, then no subheads should be introduced. Styles
to this architecture's combination of dropout and ReLU named “Heading 1,” “Heading 2,” “Heading 3,” and
layers. We can classify each profile as authentic or fake based “Heading 4” are prescribed.
on a predetermined threshold thanks to the likelihood score
that the final output layer offers.  Standardization: Set the variables to have a standard
deviation of one and a mean of zero.
C. Data Reduction:  Covariance Matrix: Determine the relationship between
Following data preprocessing, we use Principal each variable and all other variables.
Component Analysis (PCA) to reduce dimensionality in the  Eigende composition: Determine the covariance matrix's
MIB dataset's numerical characteristics. The "curse of eigenvalues (the magnitude of variance) and eigenvectors
dimensionality," which causes data to become sparse and (the directions of maximum variance).
learning to become challenging, can result from high  Principal Component Selection: To preserve the greatest
dimensionality, which can also lengthen training times. amount of variance, select the top eigenvectors according
Principal components (PCs), a smaller group of features to their matching eigenvalues.
identified by PCA, are responsible for capturing the most  Projection: To create a new feature space, project the
important variation in the data. This enables us to keep original data onto the chosen principal components.
important information while reducing the number of features.
Our strike a compromise between computing efficiency and
information retention by choosing a subset of informative
830
 Feed Forward Neural Network :  Input Layer: Gets the first features or data. A feature (such
A feedforward neural network, in which node as the pixel value in image recognition) is represented by
connections do not produce cycles, is the most basic type of each node. It serves as the neural network's data entry
artificial neural network. This structure consists of an input point.
layer, an output layer, and one or more hidden layers. There  Hidden Layers: In between input and output are
is just one direction in which information moves from input intermediary layers. Nodes use activation functions and
to output. Each layer's nodes use weighted connections to weighted connections to carry out intricate
digest information from the previous layer before sending it transformations. They allow the network to use input data
to the next layer, where activation functions are often applied. to learn abstract features.
 Weighted Connections: Weighted connections between
nodes in neighboring layers. These establish the strength
of the connection and are modified based on error during
training. They stand for the significance of input features
in predic- tion- making.
 Activation functions introduce non-linearity. applied to the
weighted aggregate of the inputs, allowing the network to
identify complex patterns. They facilitate the discovery of
complex relationships within the data.
 Output Layer: Uses processed input data to generate
network predictions. The task determines the number of
nodes (e.g., numerous for multi-class classifica-tion, one
for binary classification). It offers the neural network's
ultimate output. Architecture of system.
Fig 1: Feedforward Neural Network

IV. ARCHITECTURE OF SYSTEM

Fig 2: Architecture of System

The JODIE model is optimized for early detection of extracting features from verified and non-confirmed
malicious Twitter users by dynamically analyzing temporal accounts and feeding them into a model that determines
user interactions. Key operations include updating and whether the profile is authentic.
projecting user embedding, creating a trajectory to model
interactions over time. The embedding foreseeing model VI. CONCLUSION
(EFM) uses JODIE to predict future interactions, while the
embedding classification model (ECM) classifies users as In conclusion, our deep learning-based experiment on
fake or legitimate. This layered approach to data processing Twitter Fake profile identification offers promising results
in social networks helps proactively identify malicious and insights for improving online security. Through the
behavior patterns before they escalate creation and assessment of our deep learning model, we were
able to determine how well it could accurately identify Fake
V. SEQUENCE FLOW profiles. By examining user activity, account information,
and tweet content, a deep learning model offers a practical
The procedure for identifying fake profiles on Twitter is way to identify Fake Twitter profiles. Its excellent accuracy,
illustrated by this sequence diagram. The first step involves scalability, and automation make it a useful tool for spotting
the user registering a Twitter account, which is then verified. fraudulent accounts.
Data is provided for preprocessing to clean and normalize it
after verification. The detection result is then obtained by
831
KEY FINDINGS

 The deep learning model achieved promising results in

identifying fake profiles on Twitter, demonstrating the
potential of this approach for combating misinformation
and improving platform integrity.
 The application of PCA for dimensionality reduction
yielded valuable insights. In our case, PCA achieved
comparable performance without compromising
accuracy, suggesting its effectiveness in this.

FUTURE DIRECTIONS

This study creates opportunities for more investigation.

The effectiveness of various deep learning architectures and
optimization strategies for detecting phony profiles can be
examined. Furthermore, experimenting with different text
preparation techniques or adding additional user-related
elements (such as network activity) might improve model
performance. All things considered, this study shows how
deep learning can be used to identify fake Twitter profiles.
We can help create a more dependable and trust- worthy
online environment by consistently improving methods and
investigating novel approaches. Finally, additional study is
required to ascertain whether the idea can be applied to
different social media platforms.

REFERENCES

[1]. Mohammad Abu Snober, ”Detecting Twitter Fake

Accounts using Machine Learning and Data
Reduction Techniques,” ResearchGate, 2021.
[2]. Buket Er¸sahin, Ozlem Akta¸s, Deniz Kılınc¸, and
Ceyhun Akyol,¨ ”Twitter Fake Account Detection,”
2017.
[3]. Ruben Sanchez-corcuera , Arkaitz Zubiaga ,”Early
detection and prevention of malicious user behaviour
on Twitter using Deep learning technique”,2024.
[4]. Sarangam Kodati, Kumbala Pradeep Reddy,
Sreenivas Mekala, PL Srinivasa Murthy, and P
Chandra Sekhar Reddy, Detection of Fake Profiles on
Twitter Using Hybrid SVM Algorithm, "2021.
[5]. Louzar Oumaima, Ramdi Mariam, Baida Ouafae,
Lyhyaoui Abdelouahid,"Fake Account Detection in
Twitter using Long Short Term Memory and
Convolutional Neural Network",2024.
[6]. Faisal S. Alsubaei,"Article Detection of Inappropriate
Tweets Linked to Fake Accountson Twitter",2023.
[7]. K. Harish, R. Naveen Kumar, Dr. J. Briso Becky Bell
,"Fake Profile Detection Using Machine Learning
",2023.
[8]. GIUSEPPE SANSONETTI, FABIO GASPARETTI,
GIUSEPPE D’ANIELLO AND ALESSANDRO
MICARELLI,
[9]. Unreliable Users Detec- tion in Social Media Deep
Learning Techniques for Automatic Detec- tion, 2020.

832