Project Report 15
Project Report 15
PROJECT REPORT
PHASE I
Submitted by
ABINAYA D [21CS062]
AKSHAYA E [21CS067]
INIKA R K [21CS092]
NANDHINI S [21CS116]
BACHELOR OF ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING
(AUTONOMOUS)
DECEMBER 2024
MUTHAYAMMAL ENGINEERING COLLEGE
(AUTONOMOUS)
RASIPURAM
BONAFIDE CERTIFICATE
Certified that this Report “PHISH CATCHER AND WEB SPOOFING ATTACK
USING MACHINE LEARNING” is the bonafide work of “ABINAYA D [21CS062],
AKSHAYA E [21CS067], INIKA R K [21CS092], NANDHINI S
[21CS116]” who carried out the work under my supervision.
SIGNATURE SIGNATURE
Dr.G.KAVITHA, M.S (By Research), Ph.D., Mrs.S.NAZEEMA, M.E.,
PROFESSOR ASSISTANT PROFESSOR
HEAD OF THE DEPARTMENT SUPERVISOR
--
Department of Computer Science and Department of Computer Science and
Engineering, Engineering,
Muthayammal Engineering College Muthayammal Engineering College
(Autonomous), Rasipuram-637 408. (Autonomous), Rasipuram-637 408.
Submitted for the Project Work Phase-I Viva-Voce examination held on ___________
activities.
We here like to record our deep sense of gratitude to our beloved Principal
We extend our sincere thanks and gratitude to our Head of the Department
Science and Engineering for his efforts to complete our project successfully.
We are very much thankful to our Parents, Friends and all Faculty Members of the
iii
Vision of the Institute
To be a Centre of excellence in Engineering, Technology and Management on par with
International standards
Mission of the Institute
• To prepare the students with high professional skills and ethical values
• To impart knowledge through best practices
• To instill spirit of innovation through training, research and development
• To undertake continuous assessment and remedial measures
• To achieve academic excellence through intellectual, emotional and social
stimulation
iv
Program Outcomes (POs)
PO1 - Engineering knowledge: Apply the knowledge of mathematics, science,
engineering fundamentals, and an engineering specialization to the solution of complex
engineering problems.
PO2 - Problem analysis: Identify, formulate, review research literature, and analyze
complex engineering problems reaching substantiated conclusions using first principles
of mathematics, natural sciences, and engineering sciences.
PO3 - Design/development of solutions: Design solutions for complex engineering
problems and design system components or processes that meet the specified needswith
appropriate consideration for the public health and safety, and the cultural, societal, and
environmental considerations.
PO4 - Conduct investigations of complex problems: Use research-based knowledge
and research methods including design of experiments, analysis and interpretation of
data, and synthesis of the information to provide valid conclusions.
PO5 - Modern tool usage: Create, select, and apply appropriate techniques, resources,
and modern engineering and IT tools including prediction and modeling to complex
engineering activities with an understanding of the limitations.
PO6 - The engineer and society: Apply reasoning informed by the contextual
knowledge to assess societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to the professional engineering practice.
PO7 - Environment and sustainability: Understand the impact of the professional
engineering solutions in societal and environmental contexts, and demonstrate the
knowledge of, and need for sustainable development.
PO8 - Ethics: Apply ethical principles and commit to professional ethics and
responsibilities and norms of the engineering practice.
PO9 - Individual and team work: Function effectively as an individual, and as a
member or leader in diverse teams, and in multidisciplinary settings.
v
PO10 - Communication: Communicate effectively on complex engineering activities
with the engineering community and with society at large, such as, being able to
comprehend and write effective reports and design documentation, make effective
presentations, and give and receive clear instructions.
PO11 - Project management and finance: Demonstrate knowledge and understanding
of the engineering and management principles and apply these to one’s own work, as a
member and leader in a team, to manage projects and in multidisciplinary environments.
PO12 - Life-long learning: Recognize the need for, and have the preparation and ability
to engage in independent and life-long learning in the broadest context of technological
change.
Program Specific Outcomes (PSOs)
PSO1: Graduates should be able to design and analyze the algorithms to develop an
Intelligent Systems
PSO2: Graduates should be able to apply the acquired skills to provide efficient
solutions for real time problems
PSO3: Graduates should be able to exhibit an understanding of System Architecture,
Networking and Information Security.
vi
COURSE OUTCOMES:
At the end of the course, the student will able to
21CSP01.CO1 Understand the technical concepts of project area.
21CSP01.CO2 Identify the problem and formulation
21CSP01.CO3 Design the Problem Statement
21CSP01.CO4 Formulate the algorithm by using the design
21CSP01.CO5 Develop the Module
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3
✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
vii
INDEX
ABSTRACT x
LIST OF FIGURES xi
1.2 OBJECTIVES 3
1.3 MACHINE LEARNING 3
1.4 ADVANTAGES 4
2 LITERATURE SURVEY 3
3 SYSTEM ANALYSIS 6
3.2.1 ADVANTAGES 7
4 SYSTEM REQUIREMENTS 8
4.1 HARDWARE REQUIREMENTS 8
5 PROJECT DESIGN 12
viii
5.2 DATASET 13
5.3 PREPROCESSING 13
6.1 CONCLUSION 19
REFERENCE 20
ix
ABSTRACT
x
LIST OF FIGURES
xi
LIST OF ABBREVIATIONS
TERM ABBREVIATIONS
AI Artificial Intelligence
CSV Comma Separated Values
DC Decision Tree
DL Deep Learning
DNN Deep Neural Network
EDA Exploratory Data Analysis
HTML Hyper Text Markup Language
HTTP Hyper Text Transfer Protocol
KNN K-nearest Neighbors Algorithm
LSTM Long Short-Term Memory
ML Machine Learning
PSO Particle Swarm Optimization
SSL Secure Socket Layer
SVM Support Vector Machine
URL Uniform Resource Locator
VGA Video Graphics Array
xi
i
CHAPTER 1
INTRODUCTION
1.2 OBJECTIVE
The objective of phishing website detection is to identify and classify
fraudulent websites that mimic legitimate ones to deceive users into providing
1
sensitive information. This is typically achieved using machine learning
techniques that analyze various features of URLs, content, and user behavior to
differentiate between legitimate and malicious sites. The goal is to enhance online
security by accurately predicting phishing attempts and reducing the risk of user
data compromise.
1.4 ADVANTAGES
• Automated Detection
• Adaptability
• High Accuracy
• Feature Extraction
• Efficient Optimization
2
CHAPTER 2
LITERATURE SURVEY
3
designed to steal sensitive user information like login credentials and financial
details. To address this, the project aims to improve the performance of machine
learning classifiers. By assigning appropriate weights to features, the project aims
to better distinguish between legitimate and phishing websites, thereby improving
classification accuracy.
In this approach, website attributes such as URL structure, HTML content,
and server behaviour are analyzed and treated as input features for a machine
learning classifier. PSO is used to identify the optimal weight for each feature by
minimizing a fitness function that reflects classification errors. The result is a
robust, intelligent system capable of real-time detection of phishing websites,
contributing significantly to cybersecurity by mitigating online fraud risks and
protecting users from digital threats.
4
predict whether a website is phishing or legitimate. The approach is efficient,
scalable, and adaptable, enabling the detection of newly emerging phishing sites
that traditional rule-based methods might miss.
5
CHAPTER 3
SYSTEM ANALYSIS
3.1.1 LIMITATIONS
• Phishing attacks are constantly evolving, making it difficult for machine
learning models to keep up.
• Attackers often use sophisticated methods, such as URL obfuscation and
social engineering, to evade detection.
• Phishing websites may share characteristics with legitimate sites,
complicating the feature extraction process.
6
• Phishing detection needs to occur in real-time to be effective, which can
strain computational resources.
• Machine learning models often rely on historical data for training, which
may not account for new phishing strategies.
3.2.1 ADVANTAGES
• Enhanced Detection Accuracy
• Realtime Mitigation
• Adaptability to Evolving threats
• Reduction of False Positives
7
CHAPTER 4
SYSTEM REQUIREMENTS
8
4.3 SOFTWARE DESCRIPTION
• Python
• Python Features
Python
Python is a high-level, general-purpose programming language created by
Guido van Rossum in the early 1990s. It has since become one of the most popular
and widely used programming languages due to its simplicity, readability, and
versatility. Python is known for its clear syntax, which makes it easy to learn and
use, even for beginners. It also has a vast and active community that contributes
to its development and support. Python is a versatile language that can be used for
a wide range of tasks, including web development, data science, machine learning,
and scientific computing. It is also a popular choice for scripting and automation
tasks. Python's popularity is due in part to its large and comprehensive standard
library, which includes modules for a variety of tasks, such as file I/O,
networking, and web scraping. Python is a dynamicand interpreted language,
which means that it does not need to be compiled before it can be run. This makes
Python very fast and easy to develop with, as you can write and run code without
having to wait for it to compile. Python is also a memory-managed language,
which means that you don't have to worry about manually allocating and
deallocating memory. Python's popularity is due in part to its large and
comprehensive standard library, which includes modules of tasks, such as file I/O,
networking, and web scraping. Develop web applications: Python is a popular
choice for web development due to its ease of use and powerful frameworks like
Django and Flask. Analyze data: Python is a popular choice for data science due
to its powerful libraries like NumPy, Pandas, and Build machine learning 20
models: Python is a popular choice for machine learning due to its powerful
libraries like scikit-learn and Tensor Flow.
9
Figure 4.1 Working of Python Interpreter
10
transform data for machine learning models. Pandas simplifies tasks like loading
datasets from CSV or databases, extracting features such as URL length or special
characters, and cleaning data by handling duplicates or missing values. Itsupports
exploratory data analysis (EDA) to identify patterns and trends and prepares data
for machine learning by encoding labels or normalizing features. Post-prediction,
Pandas aids in result analysis, such as identifying misclassified URLs, and
integrates with visualization libraries like Matplotlib for reporting.Its seamless
integration with other Python tools and scalability makes Pandas essential for
creating robust phishing detection workflows.
Flask
Flask is a lightweight and versatile Python web framework, making it ideal
for building phishing website detection systems. Its simplicity and scalability
enable seamless integration of machine learning models, API development, and
interactive user interfaces. Flask allows easy deployment of ML models as
RESTful APIs, enabling workflows where URLs are input, preprocessed, and
classified as phishing or legitimate. It supports building APIs for integration with
mobile or web applications and serves HTML templates via Jinja2 for basic user
interfaces. Flask also handles feature extraction and preprocessing, such as
analyzing URL length or special characters, and supports real-time phishing
detection through fast request processing. It integrates with databases like
MySQL or MongoDB to store results and log queries while being scalable enough
to fit into microservices-based architectures. Additionally, its error handling and
feedback capabilities enhance user experience and system reliability.
11
CHAPTER 5
PROJECT DESIGN
12
5.2 DATASET
Upon inspection, it was observed that the Domain column was irrelevant
for training the machine learning model. After removing this column, the dataset
was refined to 16 features alongside a target column. The features from both
legitimate and phishing URL datasets were concatenated in the feature extraction
stage without shuffling, which could potentially introduce bias. To address this,
the data was shuffled to balance its distribution before splitting it into training and
testing sets. Shuffling the dataset helps ensure a fair representation of classes in
both subsets, preventing biases and reducing the risk of overfitting during model
training.
5.3 PREPROCESSING
Preprocessing is a crucial step in phishing website detection as it involves
transforming raw data into a structured format that can be effectively analyzed by
machine learning models. The primary goal of preprocessing is to enhance the
accuracy and efficiency of detection systems by ensuring that the data used is
clean, relevant, and capable of revealing patterns that distinguish phishing
websites from legitimate ones. One of the key tasks in preprocessing is feature
extraction from URLs, which involves identifying important characteristics such
as the length of the URL, the presence of suspicious
13
keywords, and the structure of the domain name. Features like subdomains, the
use of HTTPS, and domain name patterns are also crucial indicators that need to
be extracted and analyzed. In addition to URL-based features, analyzing the
HTML content of a website is essential. The HTML structure can contain valuable
information, such as the use of specific meta tags, embedded links, and JavaScript,
all of which may point to the presence of phishing attempts.
14
5.4 FEAUTURE EXTRACTION
Feature extraction plays a critical role in detecting phishing websites
and web spoofing attacks using machine learning. Key features extracted from
URLs include length, presence of suspicious keywords, domain patterns, and
subdomains, all of which can signal phishing attempts. Additionally, analyzing
HTML content helps identify malicious elements like iFrames, forms designed
to capture user credentials, and suspicious scripts. Server characteristics such as
domain age, IP address location, and SSL certificate validity are also important
indicators, as phishing sites often use newly registered domains or invalid SSL
certificates. Content-based features like text analysis help detect phishing, as these
sites often contain spelling errors or aggressive language to prompt user action.
Behavioural features, such as mouse movements or click patterns, and time spent
on the website, can provide additional insights into suspicious activity. Lastly,
contextual features like reputation data or referral information further assist in
identifying phishing websites.
Such trees take time to build and can lead to overfitting. That means the
tree will give very good accuracy on the training dataset but will give bad
accuracy in test data. There are many ways to tackle this problem through
hyperparameter tuning. We can set the maximum depth of our decision tree using
the max_depth parameter. The more the value of max_depth, the more complex
your tree will be. The training error will off-course decrease if we increase the
max_depth value but when our test data comes into the picture, we will get a very
bad accuracy. Hence you need a value that will not overfit aswell as underfit
our data and for this, you can use GridSearchCV.
16
operate effectively without a lot of parameters adjusting, and don't require data
scalability. Random forest is a Supervised Machine Learning Algorithm that is
used widely in Classification and Regression problems. It builds decision trees on
different samples and takes their majority vote for classification and average in
case of regression.
17
So he decides to consult various people like his cousins, teachers, parents,
degree students, and working people. He asks them varied questions like why he
should choose, job opportunities with that course, course fee, etc. Finally, after
consulting various people about the course he decides to take the course suggested
by most of the people. Ensemble uses two types of methods:
ADVANTAGES
18
CHAPTER 6
CONCLUSION AND FUTURE ENHANCEMENT
6.1 CONCLUSION
This survey reviewed various algorithms and approaches proposed by
researchers for detecting phishing websites using machine learning techniques.
The analysis of the literature revealed that most researchers utilized well-known
machine learning algorithms such as Naïve Bayes, Support Vector Machines
(SVM), Decision Trees, and Random Forests to build their detection models.
These algorithms were chosen for their proven reliability and effectiveness in
handling classification problems, especially in identifying phishing websites.The
survey also summarized experimentally successful techniques in identifying
phishing URLs. As phishing attacks continue to grow in complexity and
frequency, the survey emphasized the need to continuously update detection
systems by incorporating new features or replacing outdated ones. This
adaptability ensures that machine learning models remain effective in combating
the evolving nature of phishing threats. By exploring and integrating these
advancements, the field of phishing detection can further enhance its resilience
against sophisticated cyberattacks.
19
REFERENCES
10. J. Mao, W. Tian, P. Li, T. Wei, and Z. Liang, ‘‘Phishing-alarm: Robust and
efficient phishing detection via page component similarity,’’ IEEE Access,
vol.5, pp. 17020–17030 - 2017.
20
11. P. Rao, J. Gyani, and G. Narsimha, ‘‘Fake profiles identification in online
social networks using machine learning and NLP,’’ Int. J. Appl. Eng. Res.,
vol. 13, no. 6, pp. 973–4562 – 2018
12. D. Sahoo, C. Liu, and S. C. H. Hoi, ‘‘Malicious URL detection using
machine learning: A survey,’’ 2017, arXiv:1701.07179.
13. M. Sanchez-Paniagua, E. F. Fernandez, E. Alegre, W. Al-Nabki, and V.
Gonzalez-Castro, ‘‘Phishing URL detection: A real-case scenario through
login URLs,’’ IEEE Access, vol. 10, pp. 42949–42960 – 2022
14. E.Sharafam alebary , Waleed Ali (Member, IEEE) “Particle Swarm
Optimization-Based Feature Weighting for Improving Intelligent Phishing
Website Detection.” – 2020.
15. K. Yu, L. Tan, S. Mumtaz, S. Al-Rubaye, A. Al-Dulaimi, A. K. Bashir, and
A. Khan, ‘‘Securing critical infrastructures: Deep-learning-based threat
detection in IIoT,’’ IEEE Commun. Mag., vol. 59, no. 10, pp. 76–82, Oct.
2021.
21
22