0% found this document useful (0 votes)
71 views20 pages

Phishing Phase1 Report

This document presents a literature review of prior research on phishing website detection using machine learning techniques. Several studies are summarized that used datasets of phishing and legitimate URLs to evaluate classifiers including support vector machines, random forests, neural networks, decision trees and Naive Bayes. Accuracies of over 90% were typically achieved, with one study obtaining 99.18% using an RNN-GRU model. Features examined included attributes from the URL, web traffic data, port numbers and IP addresses. Some limitations around small datasets and discrete features are also noted.

Uploaded by

5082 SAKTHIVEL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views20 pages

Phishing Phase1 Report

This document presents a literature review of prior research on phishing website detection using machine learning techniques. Several studies are summarized that used datasets of phishing and legitimate URLs to evaluate classifiers including support vector machines, random forests, neural networks, decision trees and Naive Bayes. Accuracies of over 90% were typically achieved, with one study obtaining 99.18% using an RNN-GRU model. Features examined included attributes from the URL, web traffic data, port numbers and IP addresses. Some limitations around small datasets and discrete features are also noted.

Uploaded by

5082 SAKTHIVEL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

PHISHING WEBSITE ANALYZER

USING MACHINE LEARNING


IT8811 - PROJECT WORK
PHASE 1 - REPORT

Submitted by

NARESH R (312420205062)

RAJASEKARAN B (312420205074)

BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY

St. JOSEPH’S INSTITUTE OF TECHNOLOGY, CHENNAI- 600 119

(An Autonomous Institution)

ANNA UNIVERSITY,CHENNAI 600025


OCTOBER 2023

i
ANNA UNIVERSITY: CHENNAI 600 025

BONAFIDE CERTIFICATE

Certified that this project report “PHISHING WEBSITE ANALYZER


USING MACHINE LEARNING” is the bonafide work of Naresh R
(312420205062) and Rajasekaran B (312420205074) who carried out the Mini
project work under my supervision.

SIGNATURE SIGNATURE
Dr. S.KALARANI M.E., Ph.D., Ms. S. Anslam Sibi M.E.,(Ph.D).,
Professor Assistant Professor
HEAD OF THE DEPARTMENT SUPERVISOR
Department Of Department Of
Information Technology Information Technology
St.Joseph’s Institute of Technology St.Joseph’s Institute of
Old Mamallapuram Road Technology Old Mamallapuram
Chennai-600119 Road
Chennai-600119

Submitted for the Viva-Voce held on

(INTERNAL EXAMINER) (EXTERNAL EXAMINER)


ii
CERTIFICATE OF EVALUATION

College Name : St. Joseph’s Institute of Technology

Branch & Semester : Information Technology (VII)

S.NO NAMES OF TITLE OF THE NAME OF THE


STUDENTS PROJECT SUPERVISOR
WITH
DESIGNATION
1. Naresh R “PHISHING Dr. L. Javid Ali
WEBSITE
(312420205062)
ANALYZER
2. Rajasekaran B USING

(312420205074) MACHINE
LEARNING”

The report of the project work submitted by the above students for Project in
information technology of Anna University were evaluated and confirmed to be
reports of the work done by the above students and then evaluated.

(INTERNAL EXAMINER) (EXTERNAL EXAMINER)

iii
ABSTRACT:

Due to the rapid growth of internet services has been accompanied by a range of
malicious attempts to trick individuals into performing undesired actions, by using the
Internet, attackers set out new techniques, such as phishing.
With the use of false websites, attackers collect sensitive information such as user
data, login credentials, social security number, banking information etc. Recognizing
whether a website is authorized or phishing is a difficult problem.
In this paper a phishing website analyzer using machine learning is proposed ,this
model predicts whether the website is recognized or not, which uses different
classification algorithms and natural language processing (NLP) based features.

iv
LIST OF FIGURES

FIG NO NAME OF THE FIGURE PAGE

4.1 ARCHITECTURE DIAGRAM

4.2 USE CASE DIAGRAM

4.3 ACTIVITY DIAGRAM

4.4 SEQUENCE DIAGRAM

4.5 COMPONENT DIAGRAM

v
TABLE OF CONTENTS

CHAPTER TITLE PAGE NO

ABSTRACT

LIST OF

1 FIGURES

INTRODUCTION

System Overview

Scope of the project

2 LITERARTURE SURVEY
3
SYSTEM ANALYSIS

Existing System

Proposed System

Advantages of the Proposed System


Disadvantages of the Proposed System

Requirement Specification
Software Requirement
Hardware Requirements
4
SYSTEM DESIGN
Architecture Diagram

Use case diagram

vi
Activity diagram

Sequence diagram
Component diagram
5
SYSTEM IMPLEMENTATION

Data Collection Module


Data Preprocessing Module
Machine Learning Model Module
User Interface Module
Continuous Update Module
Security and Ethical Module

6 CONCLUSION AND FUTURE


ENHANCEMENTS

vii
CHAPTER 1

INTRODUCTION
1.1 SYSTEM OVERVIEW
Phishing attacks are derived from the word ‘fishing’ for victims. Attackers are named as
phishers, they attract the user by creating fraudulent websites with a similar design of the
popular and legal sites on the internet.
Main focus of this paper is real-time detection of phishing web pages by investigating the
URL of the web page with different machine learning algorithms.
Therefore, firstly we collect lots of legitimate and fraudulent web page URLs from the
dataset ,Natural Language Processing(NLP) based features are used, after that machine
learning algorithms logistic regression, Support vector machine ,Naive bayes, Random
forest algorithm, K-Nearest Neighbor are implemented , to measure the efficiency of the
proposed system.

1.2 AIM OF THE PROJECT

The aim of the "Phishing Website Analyzer" project is to develop a system that
effectively detects and prevents phishing websites using Natural Language
Processing (NLP) and machine learning. The project focuses on enhancing online
security by addressing the limitations of existing systems or the lack of a
systematic approach to phishing detection. The primary goal is to achieve accurate
phishing URL detection, reducing false positives and false negatives. The system
will provide real-time or on-demand analysis of websites, enabling timely threat
detection and response. Continuous adaptation to evolving phishing techniques
through updates and model retraining is a key objective. A user-friendly interface
will simplify URL analysis, while ethical considerations will ensure user privacy
and prevent misuse. Additionally, user education efforts will empower users to
recognize and protect themselves against phishing threats. The project's ultimate
aim is to contribute to a safer online environment by prioritizing user security and
privacy.

8
CHAPTER 2

LITERATURE SURVEY

N. Choudhary b, K. Jain, S. Jain : This study emphasizes the significance of only

using attributes from the URL. Both the Kaggle and Phishtank websites make it easy

to get the dataset used in this study. The researchers used a hybrid approach that com-

bined Principal Component Analysis (PCA) with Support Vector Machine (SVM) and

Random Forest algorithms to reduce the dataset's dimensionality while keeping all im-

portant data, and it produced a higher accuracy rate of 96.8% compared to other tech-

niques investigated.

A. Lakshmanarao, P. Surya, M Bala Krishna : This thesis collected a dataset of

phishing websites from the UCI repository and used various Machine learning tech-

niques, including decision trees, AdaBoost, support vector machines (SVM), and ran-

dom forests, to analyze selected features (such as web traffic, port, URL length, IP

address, and URL_of_Anchor). The most effective model for detecting phishing web-

sites was chosen, and two priority-based algorithms (PA1 and PA2) were proposed.

The team utilized a new fusion classifier in conjunction with these algorithms and at-

tained an accuracy rate of 97%. when compared to previous works in phishing website

detection

L. Tang, Q. Mahmoud : The proposed approach in the current study uses URLs

collected from a variety of platforms, including Kaggle, Phish Storm, Phish Tank, and

ISCX-UR, to identify phishing websites. The researchers made a big contribution since

they created a browser plug-in that can quickly recognize phishing risks and offer warn-
9
ings. Various datasets and machine learning techniques were investigated, and the pro-

posed RNN-GRU model outperformed SVM, Random Forest (RF), and Logistic Re-

gression with a maximum accuracy rate of 99.18%. On the other hand, the suggested

method is not always accurate in identifying if short links are phishing risks.

A. Kulkarni & L. Brown: A machine learning system was created to categorize

websites based on URLs from the University of California, Irvine Machine Learning

Repository. Four classifiers were used: SVM, decision tree, Naive Bayesian, and neural

network. The outcome of experiments utilizing the model developed with the support

of a training set of data demonstrates that the classifiers were able to successfully dif-

ferentiate authentic websites from fake ones with an accuracy rate of over 90%. Limi-

tations include a small dataset and all features being discrete, which may not be suitable

for some classifiers.

Tyagi; J. Shad; S. Sharma; S. Gaur Gagandeep Kaur : The research taken into

account focuses on the use of various machine learning algorithms to identify if a web-

site is legitimate or a phishing site based on a URL. This study's most important con-

tribution is the creation of the Generalized Linear Model (GLM), a brand-new model.

This model combines the results of two various methods. With a 98.4% accuracy rate,

the Random Forest and GLM combination produced the best results for detecting phish-

ing websites.

10
CHAPTER 3

SYSTEM ANALYSIS

3.1 EXISTING SYSTEM

In this section, you'll describe the limitations and drawbacks of the current state of
phishing website detection or explain the absence of a systematic approach:

Absence of a System: Currently, there is no dedicated system or method in place for


phishing website detection. This means that users are not provided with any protection or
warnings when visiting potentially harmful websites.

Lack of Accuracy: Without a proper system, there is a higher risk of users encountering
phishing websites without realizing it. This leads to a lack of accuracy in identifying and
blocking malicious sites, potentially resulting in users falling victim to phishing attacks.

Inability to Adapt: In the absence of a dedicated system, there is no mechanism for


adapting to evolving phishing techniques. Traditional methods for detecting phishing
websites may be outdated and unable to keep up with the sophistication of modern attacks.

User Vulnerability: Users are left vulnerable to phishing attacks due to the absence of an
effective detection system. This can result in financial losses, data breaches, and
compromised personal information.

Challenges: The existing system, which is effectively a lack of a system, presents various
challenges in ensuring the safety and security of online activities. Users have to rely on
their own judgment and awareness to identify potential threats.

3.1.1 DISADVANTAGES:

 Lack of Systematic Approach: Due to the absence of a dedicated system, there is no


systematic approach to identify and block phishing websites. Users are left vulnerable to
potentially harmful sites.
 Accuracy Issues: The absence of a formal system leads to accuracy issues. Users might
encounter phishing websites without proper detection, leading to potential financial losses,
data breaches, and compromised personal information.

11
 Inability to Adapt: The current system, essentially the lack of one, struggles with
adapting to evolving phishing techniques. Traditional methods for detecting phishing
websites may not be equipped to handle the ever-evolving sophistication of modern
attacks.
 User Vulnerabilities: Users are currently vulnerable to phishing attacks due to the
absence of an effective detection system. This puts their financial security and personal
data at risk.
 Reliance on User Judgment: Without a dedicated system, users have to rely on their own
judgment and awareness to identify potential threats, which is not foolproof.
3.2 PROPOSED SYSTEM

In this section, you'll introduce your proposed phishing website analyzer, highlighting the
key features and benefits:

Novel Approach: The proposed system represents a novel approach to phishing website
detection. It addresses the limitations of the existing system by providing a structured
method for identifying and blocking phishing websites.

Accurate Detection: The core advantage of the proposed system is its ability to
significantly enhance phishing URL detection accuracy. By leveraging machine learning
and NLP techniques, it can distinguish between legitimate and phishing websites with a
high level of precision.

Real-time Analysis: The system allows for real-time or on-demand analysis of websites.
Users can input URLs for immediate analysis, which is crucial for timely threat detection
and response.
3.2.1 ADVANTAGES

Enhanced Security Measures: The proposed system incorporates advanced security


measures to protect against evolving phishing tactics. It can adapt to new threats by
regularly updating its database and retraining the machine learning model.
12
User-friendly Interface: A user-friendly interface has been designed to ensure that users
can easily interact with the system. This includes a straightforward process for submitting
URLs for analysis, with clear and intuitive feedback on the potential threat level.

Continuous Updates: To stay effective, the system is designed to continuously update its
dataset and retrain the machine learning model. This ensures that it remains current and
can identify the latest phishing techniques.

Ethical Considerations: Ethical considerations are integral to the system's design. User
privacy and consent are respected, and measures are in place to prevent misuse of the
system by malicious actors.

Costs and Resources: The development and implementation of the proposed system
come with associated costs, such as hardware, software, and manpower. However, these
costs are justified by the system's ability to enhance online security.

User Education: While the system offers advanced protection, it is important to note that
user education remains a key component of online security. The system complements user
awareness efforts but does not replace them.

3.3 REQUIREMENT SPECIFICATION


The requirements for a phishing website analyzer project involve both software and hardware
elements, as well as other considerations such as data and ethical requirements. Here's a detailed
breakdown of these requirements:

3.3.1 Software Requirements:

Operating System: The project should specify the supported operating systems for
running the software. Common choices include Windows, macOS, and Linux.

Programming Language: Define the programming language for system development.


Python is often used for machine learning and NLP applications.

13
Machine Learning Libraries: Specify the machine learning libraries that will be used for
implementing the detection algorithms. Common libraries include scikit-learn,
TensorFlow, and PyTorch.

NLP Libraries: Mention the natural language processing libraries required for analyzing
textual content. Popular choices include NLTK, spaCy, and Gensim.

Web Scraping Tools: Identify the tools or libraries needed to collect website data for the
dataset. Tools like Beautiful Soup or Scrapy are often used.

Database Management System: Specify the database management system for storing
and managing datasets. Options include MySQL, PostgreSQL, MongoDB, or SQLite.

User Interface Development Tools: Detail the tools, frameworks, or libraries used to
create a user-friendly front-end for users to interact with the system. Common choices
include Flask, Django, or JavaScript frameworks like React or Angular.

Security Tools: Consider including security tools or libraries for securing the system
against potential attacks and ensuring data privacy.

Web Hosting: If the system includes a web-based component, specify the web hosting
service or server requirements.

3.3.2 Hardware Requirements:

Computational Resources: Ensure that you have sufficient computational resources for
training machine learning models. This may involve powerful CPUs and GPUs.

Storage Capacity: Allocate enough storage capacity for maintaining datasets, model
checkpoints, and other relevant data. SSDs are often preferred for faster data access.

Internet Connectivity: A reliable internet connection is necessary, especially if your


system will conduct real-time website analysis.
14
CHAPTER 4

SYSTEM DESIGN

4.1 SYSTEM ARCHITECTURE


The system architecture for our phishing URL detection project leverages transfer
learning and It encompasses data loading and preprocessing modules, feature extraction,
and custom classification. This architecture aims to provide a precise and scalable solution
for phishing URL detection while accommodating future enhancements for improved
performance and accessibility.

15
4.2 ACTIVITY DIAGRAM

This activity diagram illustrates the core workflow of phishing website detection using
Natural language processing.
Fig 4.3 Activity diagram

16
4.3 SEQUENCE DIAGRAM

In the sequence diagram for our phishing URL detection, we can visualize the interaction
between actors.

Fig 4.4 Sequence diagram

17
CHAPTER 5

SYSTEM IMPLEMENTATION

5.1 MODULES

Data Collection Module:

Responsible for gathering a dataset of URLs that includes both legitimate and phishing
sites.

Data Preprocessing Module:

Cleans and preprocesses the URL data, removing duplicates, special characters, and
normalizing the data.

Machine Learning Model Module:

Implements the machine learning algorithms for URL classification, such as Logistic
Regression, Naive Bayes, or Random Forest.

User Interface Module:

Develops the user interface for users to interact with the system, allowing them to input
URLs for analysis and displaying results.

Continuous Update Module:

Handles regular updates and retraining of the machine learning model to keep the system
effective against evolving phishing techniques.

Security and Ethical Module:

Implements security measures to protect the system and user data and ensures ethical
considerations and privacy measures are in place.

18
CHAPTER 6
CONCLUSION AND FUTURE ENHANCEMENT

6.1 CONCLUSION

In conclusion, the development of a phishing website analyzer using Natural Language


Processing (NLP) and machine learning represents a significant step toward enhancing
online security and protecting users from phishing threats. This project addresses the
limitations of the existing systems, offering a novel approach to accurately detect and
block phishing websites. The proposed system's advantages include enhanced accuracy,
real-time analysis, advanced security measures, a user-friendly interface, continuous
updates, and a strong emphasis on ethical considerations.

The system's modules, from data collection and preprocessing to machine learning, user
interface, continuous updates, and security and ethics, form a cohesive framework that
ensures efficient phishing detection and user protection.

Future Enhancements:

The phishing website analyzer project can be further improved and expanded in several
ways:

Advanced Machine Learning Models: Explore more advanced machine learning models
and deep learning techniques for even higher accuracy in phishing detection.

Behavioral Analysis: Implement behavioral analysis of websites in addition to NLP-


based analysis to enhance detection capabilities.

Real-time Alerts: Develop a real-time alerting system that can notify users when they
visit a potentially phishing website, enhancing proactive protection.

User Feedback Mechanism: Incorporate a user feedback mechanism to allow users to


report suspicious websites, thereby enhancing the system's learning and adaptability.

Integration with Browsers: Create browser extensions or plugins that integrate directly
with popular web browsers to provide seamless protection.

Multi-language Support: Extend the system's language support to detect phishing


websites in multiple languages.

19
Mobile Application: Develop a mobile application for on-the-go URL analysis and
protection.

Collaboration with ISPs: Collaborate with internet service providers (ISPs) to implement
phishing website detection at the network level, preventing users from accessing harmful
websites.

AI-Driven Analysis: Incorporate artificial intelligence (AI) components for improved


decision-making and threat identification.

Blockchain-based Data Security: Implement blockchain technology to secure and


protect the dataset, ensuring data integrity and privacy.

Open-source Initiative: Consider making the project open-source to encourage


collaboration and contributions from the cybersecurity community.

User Education Campaigns: Continue to focus on user education, with awareness


campaigns and resources to empower users to identify phishing threats.

20

You might also like