0% found this document useful (0 votes)

19 views8 pages

Social Media Fake Account Detection Report 20pages

The document outlines a project aimed at developing a machine learning-based system for detecting fake accounts on social media platforms, addressing the growing concern of misinformation and fraud. It details the system's objectives, design, technologies used, evaluation metrics, challenges, and future enhancements, emphasizing the importance of data privacy and interdisciplinary collaboration. Key features include modularity, real-time detection, and the use of advanced NLP techniques to improve detection accuracy.

Uploaded by

rohitadwani365

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views8 pages

Social Media Fake Account Detection Report 20pages

Uploaded by

rohitadwani365

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

System Design Report

Project Title: Social Media Fake Account Detection

1. Introduction

Fake accounts are a growing concern not only because they skew
analytics and user engagement statistics, but also because they can be
weaponized in coordinated campaigns to mislead the public or execute
financial fraud. The significance of this issue is underscored by numerous
case studies in which bot networks have influenced public opinion or
engagement metrics on major platforms like Twitter and Instagram.

In recent years, the sophistication of fake accounts has grown

exponentially. With AI-driven content generation, fake accounts can now
simulate highly realistic behavior, making detection increasingly complex.
Social media companies are investing heavily in countermeasures, but an
adaptable and data-driven approach remains crucial.

These challenges demand not only technological solutions but also

interdisciplinary collaboration between data scientists, cybersecurity
experts, and sociologists.

Social media platforms have revolutionized the way people communicate and share
information. However, with the increasing popularity of these platforms,
there is also a surge in the number of fake accounts created for malicious purposes such as
spreading misinformation, phishing, spamming, and impersonation.
This project aims to develop a system for detecting such fake accounts using machine
learning techniques. The system leverages user profile features and
behavioral patterns to classify accounts as real or fake.
2. Objective

Additional goals include improving detection precision by incorporating

NLP features and temporal activity tracking. Furthermore, the system
aims to be adaptable across different social media platforms with varying
data availability.

Key performance goals include maximizing precision without sacrificing

recall, ensuring low latency for real-time detection, and maintaining user
data privacy through anonymized feature extraction. The system is
expected to provide actionable insights through dashboards and alert
systems for administrators.

The primary objective of this project is to build a reliable fake account detection system that
can identify suspicious users on social media platforms.
The goal includes data preprocessing, feature selection, model building, evaluation, and
result visualization to support decision-making.
3. System Design

The design ensures modularity, where components such as preprocessing,

prediction, and visualization can be independently scaled or upgraded.
This modularity also allows for easy integration with new data sources or
changes in social media API policies.

Security and reliability are built into the system through access control
mechanisms, encryption protocols for data in transit and at rest, and
regular model audits. The pipeline is also designed to be fault-tolerant
with retry mechanisms and logging for monitoring.

Each system component is decoupled, allowing independent deployment,

which ensures that failures in one service do not propagate to others.
Continuous integration and deployment (CI/CD) pipelines help automate
updates and testing.

3.1 Use Case Diagram

Below is the use case diagram that shows the interaction between the admin and the
system:

The admin interacts with the system by uploading the user data, initiating fake account
detection, viewing analytics, and exporting results.
The system performs preprocessing, prediction, and visualization based on the uploaded
dataset.

3.2 Database Design and Data Storage

Data is stored in structured CSV files with fields like User_ID, Username, Follower_Count,
Following_Count, etc. These features are used for model training
and prediction. Data is stored in cloud platforms (e.g., AWS S3, Firebase Storage) for
scalability and accessibility.

Example CSV Schema:

- User_ID
- Username
- Followers
- Following
- Posts
- Bio_Length
- Profile_Pic_Status (0/1)
- Verified (True/False)
- Creation_Date
- Engagement_Score
- Label (Real/Fake)

3.3 Sequence Diagram / Activity Diagram

Sequence Flow:
1. Admin uploads user data via UI.
2. System performs preprocessing using Pandas.
3. Machine Learning model (Scikit-learn or ANN) is loaded.
4. Predictions are made.
5. Matplotlib is used to generate visual analytics.
6. User downloads/export reports.

Activity Diagram Steps:

- Start
- Upload Data
- Clean and Transform Data
- Train/Load Model
- Predict Labels
- Visualize Output
- Export/Save Results
- End

3.4 Deployment Diagram

The system is deployed with the following architecture:

- Client Node: User interface for admin
- Processing Node: Python backend that runs ML models
- Cloud Node: Storage for input/output files
- Visualization Node: Generates graphical outputs

Components:
- Frontend: Flask/Django Web UI or Jupyter Notebook
- Backend: Python ML scripts (Pandas, Scikit-learn, ANN)
- Storage: Cloud storage (CSV format)

4. Technologies Used

To support scalable deployment, containerization technologies such as

Docker and orchestration tools like Kubernetes can be employed. For NLP,
libraries such as SpaCy and Transformers from Hugging Face are
particularly valuable for entity recognition and sentiment analysis.

In addition to the core stack, integration with visualization libraries such

as Plotly and dashboards with Dash enables real-time analytics.
ElasticSearch can be added for log aggregation and anomaly detection in
behavior trends.

Serverless computing options like AWS Lambda or Google Cloud Functions

can further reduce infrastructure costs and enable dynamic scaling based
on usage spikes.

Programming Language: Python

Libraries and Frameworks:

- Pandas: Data preprocessing
- Scikit-learn: Machine learning algorithms
- Matplotlib: Visualization
- NLP: Text analysis on bios and posts
- ANN: Deep learning for feature-based classification

Data Storage: CSV (cloud hosted)

Hardware: Cloud infrastructure (scalable and distributed)

5. Evaluation Metrics and Results

It is also important to evaluate the robustness of the model across

datasets from different platforms. Cross-validation and testing on unseen
social media datasets ensure that the model generalizes well and
maintains performance.

To ensure fairness and mitigate algorithmic bias, the evaluation includes

subgroup analysis by user demographic, activity type, and content
domain. Additionally, A/B testing can be conducted to measure user-
facing impact when deploying new model versions.

Model performance is evaluated using the following metrics:

- Accuracy
- Precision
- Recall
- F1-Score
- Confusion Matrix

Preliminary Results:
- Accuracy: 92%
- Precision: 90%
- Recall: 93%
- F1 Score: 91%
6. Challenges and Limitations

Another limitation is the difficulty in maintaining up-to-date labeled

datasets, as manual labeling is time-consuming. Additionally, the
detection system might be less effective on newly created fake accounts
that have minimal activity.

An emerging concern is adversarial machine learning, where attackers

attempt to manipulate models by introducing noisy data or mimicking
real users. Techniques such as adversarial training and differential privacy
can be explored as countermeasures.

Ethical considerations also include the potential misclassification of

accounts and the impact on user trust. Therefore, transparency and user
appeals mechanisms should be included in production systems.

- Imbalanced dataset with more real accounts than fake.

- Feature extraction from unstructured bio data is challenging.
- Constant evolution of fake account behavior.
- Scalability of detection for real-time platforms.
7. Future Enhancements

Beyond BERT and LSTM, exploring federated learning approaches could

allow training across multiple platforms without compromising user
privacy. Blockchain technology may also be investigated for
authenticating account provenance and tracking changes over time.

Another promising direction is the use of graph neural networks (GNNs)

for social graph-based inference. Such models leverage relationships and
interactions between users rather than isolated features. Integration with
big data platforms like Apache Spark or Flink will also enable handling of
massive, streaming datasets.

Lastly, ongoing collaboration with academic institutions can facilitate

benchmark datasets and cutting-edge research integration.

- Integrate with social media APIs for real-time detection.

- Enhance NLP analysis using advanced transformers like BERT.
- Use deep learning models like LSTM for behavior tracking.
- Build a user-friendly dashboard for continuous monitoring.