Phishing Phase1 Report
Phishing Phase1 Report
Submitted by
NARESH R (312420205062)
RAJASEKARAN B (312420205074)
BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY
i
ANNA UNIVERSITY: CHENNAI 600 025
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
Dr. S.KALARANI M.E., Ph.D., Ms. S. Anslam Sibi M.E.,(Ph.D).,
Professor Assistant Professor
HEAD OF THE DEPARTMENT SUPERVISOR
Department Of Department Of
Information Technology Information Technology
St.Joseph’s Institute of Technology St.Joseph’s Institute of
Old Mamallapuram Road Technology Old Mamallapuram
Chennai-600119 Road
Chennai-600119
(312420205074) MACHINE
LEARNING”
The report of the project work submitted by the above students for Project in
information technology of Anna University were evaluated and confirmed to be
reports of the work done by the above students and then evaluated.
iii
ABSTRACT:
Due to the rapid growth of internet services has been accompanied by a range of
malicious attempts to trick individuals into performing undesired actions, by using the
Internet, attackers set out new techniques, such as phishing.
With the use of false websites, attackers collect sensitive information such as user
data, login credentials, social security number, banking information etc. Recognizing
whether a website is authorized or phishing is a difficult problem.
In this paper a phishing website analyzer using machine learning is proposed ,this
model predicts whether the website is recognized or not, which uses different
classification algorithms and natural language processing (NLP) based features.
iv
LIST OF FIGURES
v
TABLE OF CONTENTS
ABSTRACT
LIST OF
1 FIGURES
INTRODUCTION
System Overview
2 LITERARTURE SURVEY
3
SYSTEM ANALYSIS
Existing System
Proposed System
Requirement Specification
Software Requirement
Hardware Requirements
4
SYSTEM DESIGN
Architecture Diagram
vi
Activity diagram
Sequence diagram
Component diagram
5
SYSTEM IMPLEMENTATION
vii
CHAPTER 1
INTRODUCTION
1.1 SYSTEM OVERVIEW
Phishing attacks are derived from the word ‘fishing’ for victims. Attackers are named as
phishers, they attract the user by creating fraudulent websites with a similar design of the
popular and legal sites on the internet.
Main focus of this paper is real-time detection of phishing web pages by investigating the
URL of the web page with different machine learning algorithms.
Therefore, firstly we collect lots of legitimate and fraudulent web page URLs from the
dataset ,Natural Language Processing(NLP) based features are used, after that machine
learning algorithms logistic regression, Support vector machine ,Naive bayes, Random
forest algorithm, K-Nearest Neighbor are implemented , to measure the efficiency of the
proposed system.
The aim of the "Phishing Website Analyzer" project is to develop a system that
effectively detects and prevents phishing websites using Natural Language
Processing (NLP) and machine learning. The project focuses on enhancing online
security by addressing the limitations of existing systems or the lack of a
systematic approach to phishing detection. The primary goal is to achieve accurate
phishing URL detection, reducing false positives and false negatives. The system
will provide real-time or on-demand analysis of websites, enabling timely threat
detection and response. Continuous adaptation to evolving phishing techniques
through updates and model retraining is a key objective. A user-friendly interface
will simplify URL analysis, while ethical considerations will ensure user privacy
and prevent misuse. Additionally, user education efforts will empower users to
recognize and protect themselves against phishing threats. The project's ultimate
aim is to contribute to a safer online environment by prioritizing user security and
privacy.
8
CHAPTER 2
LITERATURE SURVEY
using attributes from the URL. Both the Kaggle and Phishtank websites make it easy
to get the dataset used in this study. The researchers used a hybrid approach that com-
bined Principal Component Analysis (PCA) with Support Vector Machine (SVM) and
Random Forest algorithms to reduce the dataset's dimensionality while keeping all im-
portant data, and it produced a higher accuracy rate of 96.8% compared to other tech-
niques investigated.
phishing websites from the UCI repository and used various Machine learning tech-
niques, including decision trees, AdaBoost, support vector machines (SVM), and ran-
dom forests, to analyze selected features (such as web traffic, port, URL length, IP
address, and URL_of_Anchor). The most effective model for detecting phishing web-
sites was chosen, and two priority-based algorithms (PA1 and PA2) were proposed.
The team utilized a new fusion classifier in conjunction with these algorithms and at-
tained an accuracy rate of 97%. when compared to previous works in phishing website
detection
L. Tang, Q. Mahmoud : The proposed approach in the current study uses URLs
collected from a variety of platforms, including Kaggle, Phish Storm, Phish Tank, and
ISCX-UR, to identify phishing websites. The researchers made a big contribution since
they created a browser plug-in that can quickly recognize phishing risks and offer warn-
9
ings. Various datasets and machine learning techniques were investigated, and the pro-
posed RNN-GRU model outperformed SVM, Random Forest (RF), and Logistic Re-
gression with a maximum accuracy rate of 99.18%. On the other hand, the suggested
method is not always accurate in identifying if short links are phishing risks.
websites based on URLs from the University of California, Irvine Machine Learning
Repository. Four classifiers were used: SVM, decision tree, Naive Bayesian, and neural
network. The outcome of experiments utilizing the model developed with the support
of a training set of data demonstrates that the classifiers were able to successfully dif-
ferentiate authentic websites from fake ones with an accuracy rate of over 90%. Limi-
tations include a small dataset and all features being discrete, which may not be suitable
Tyagi; J. Shad; S. Sharma; S. Gaur Gagandeep Kaur : The research taken into
account focuses on the use of various machine learning algorithms to identify if a web-
site is legitimate or a phishing site based on a URL. This study's most important con-
tribution is the creation of the Generalized Linear Model (GLM), a brand-new model.
This model combines the results of two various methods. With a 98.4% accuracy rate,
the Random Forest and GLM combination produced the best results for detecting phish-
ing websites.
10
CHAPTER 3
SYSTEM ANALYSIS
In this section, you'll describe the limitations and drawbacks of the current state of
phishing website detection or explain the absence of a systematic approach:
Lack of Accuracy: Without a proper system, there is a higher risk of users encountering
phishing websites without realizing it. This leads to a lack of accuracy in identifying and
blocking malicious sites, potentially resulting in users falling victim to phishing attacks.
User Vulnerability: Users are left vulnerable to phishing attacks due to the absence of an
effective detection system. This can result in financial losses, data breaches, and
compromised personal information.
Challenges: The existing system, which is effectively a lack of a system, presents various
challenges in ensuring the safety and security of online activities. Users have to rely on
their own judgment and awareness to identify potential threats.
3.1.1 DISADVANTAGES:
11
Inability to Adapt: The current system, essentially the lack of one, struggles with
adapting to evolving phishing techniques. Traditional methods for detecting phishing
websites may not be equipped to handle the ever-evolving sophistication of modern
attacks.
User Vulnerabilities: Users are currently vulnerable to phishing attacks due to the
absence of an effective detection system. This puts their financial security and personal
data at risk.
Reliance on User Judgment: Without a dedicated system, users have to rely on their own
judgment and awareness to identify potential threats, which is not foolproof.
3.2 PROPOSED SYSTEM
In this section, you'll introduce your proposed phishing website analyzer, highlighting the
key features and benefits:
Novel Approach: The proposed system represents a novel approach to phishing website
detection. It addresses the limitations of the existing system by providing a structured
method for identifying and blocking phishing websites.
Accurate Detection: The core advantage of the proposed system is its ability to
significantly enhance phishing URL detection accuracy. By leveraging machine learning
and NLP techniques, it can distinguish between legitimate and phishing websites with a
high level of precision.
Real-time Analysis: The system allows for real-time or on-demand analysis of websites.
Users can input URLs for immediate analysis, which is crucial for timely threat detection
and response.
3.2.1 ADVANTAGES
Continuous Updates: To stay effective, the system is designed to continuously update its
dataset and retrain the machine learning model. This ensures that it remains current and
can identify the latest phishing techniques.
Ethical Considerations: Ethical considerations are integral to the system's design. User
privacy and consent are respected, and measures are in place to prevent misuse of the
system by malicious actors.
Costs and Resources: The development and implementation of the proposed system
come with associated costs, such as hardware, software, and manpower. However, these
costs are justified by the system's ability to enhance online security.
User Education: While the system offers advanced protection, it is important to note that
user education remains a key component of online security. The system complements user
awareness efforts but does not replace them.
Operating System: The project should specify the supported operating systems for
running the software. Common choices include Windows, macOS, and Linux.
13
Machine Learning Libraries: Specify the machine learning libraries that will be used for
implementing the detection algorithms. Common libraries include scikit-learn,
TensorFlow, and PyTorch.
NLP Libraries: Mention the natural language processing libraries required for analyzing
textual content. Popular choices include NLTK, spaCy, and Gensim.
Web Scraping Tools: Identify the tools or libraries needed to collect website data for the
dataset. Tools like Beautiful Soup or Scrapy are often used.
Database Management System: Specify the database management system for storing
and managing datasets. Options include MySQL, PostgreSQL, MongoDB, or SQLite.
User Interface Development Tools: Detail the tools, frameworks, or libraries used to
create a user-friendly front-end for users to interact with the system. Common choices
include Flask, Django, or JavaScript frameworks like React or Angular.
Security Tools: Consider including security tools or libraries for securing the system
against potential attacks and ensuring data privacy.
Web Hosting: If the system includes a web-based component, specify the web hosting
service or server requirements.
Computational Resources: Ensure that you have sufficient computational resources for
training machine learning models. This may involve powerful CPUs and GPUs.
Storage Capacity: Allocate enough storage capacity for maintaining datasets, model
checkpoints, and other relevant data. SSDs are often preferred for faster data access.
SYSTEM DESIGN
15
4.2 ACTIVITY DIAGRAM
This activity diagram illustrates the core workflow of phishing website detection using
Natural language processing.
Fig 4.3 Activity diagram
16
4.3 SEQUENCE DIAGRAM
In the sequence diagram for our phishing URL detection, we can visualize the interaction
between actors.
17
CHAPTER 5
SYSTEM IMPLEMENTATION
5.1 MODULES
Responsible for gathering a dataset of URLs that includes both legitimate and phishing
sites.
Cleans and preprocesses the URL data, removing duplicates, special characters, and
normalizing the data.
Implements the machine learning algorithms for URL classification, such as Logistic
Regression, Naive Bayes, or Random Forest.
Develops the user interface for users to interact with the system, allowing them to input
URLs for analysis and displaying results.
Handles regular updates and retraining of the machine learning model to keep the system
effective against evolving phishing techniques.
Implements security measures to protect the system and user data and ensures ethical
considerations and privacy measures are in place.
18
CHAPTER 6
CONCLUSION AND FUTURE ENHANCEMENT
6.1 CONCLUSION
The system's modules, from data collection and preprocessing to machine learning, user
interface, continuous updates, and security and ethics, form a cohesive framework that
ensures efficient phishing detection and user protection.
Future Enhancements:
The phishing website analyzer project can be further improved and expanded in several
ways:
Advanced Machine Learning Models: Explore more advanced machine learning models
and deep learning techniques for even higher accuracy in phishing detection.
Real-time Alerts: Develop a real-time alerting system that can notify users when they
visit a potentially phishing website, enhancing proactive protection.
Integration with Browsers: Create browser extensions or plugins that integrate directly
with popular web browsers to provide seamless protection.
19
Mobile Application: Develop a mobile application for on-the-go URL analysis and
protection.
Collaboration with ISPs: Collaborate with internet service providers (ISPs) to implement
phishing website detection at the network level, preventing users from accessing harmful
websites.
20