0% found this document useful (0 votes)
55 views11 pages

Sat - 26.Pdf - Phishing Website Detection Using Novel Machine Learning Fusion Approach

Here are a few potential research questions addressed in the literature review: - How effective are different machine learning algorithms (such as decision trees, neural networks, etc.) at detecting phishing URLs compared to previous rule-based detection methods? - What types of features (e.g. lexical, syntactic, website content-based) are most useful for machine learning models to accurately identify phishing websites? - How do models perform at detecting new or "zero-day" phishing attacks that use emerging techniques to evade detection? - What are some challenges and open problems in using machine learning for phishing detection, such as the need for large labeled training datasets or the arms race with adapting phishing techniques?

Uploaded by

Vj Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views11 pages

Sat - 26.Pdf - Phishing Website Detection Using Novel Machine Learning Fusion Approach

Here are a few potential research questions addressed in the literature review: - How effective are different machine learning algorithms (such as decision trees, neural networks, etc.) at detecting phishing URLs compared to previous rule-based detection methods? - What types of features (e.g. lexical, syntactic, website content-based) are most useful for machine learning models to accurately identify phishing websites? - How do models perform at detecting new or "zero-day" phishing attacks that use emerging techniques to evade detection? - What are some challenges and open problems in using machine learning for phishing detection, such as the need for large labeled training datasets or the arms race with adapting phishing techniques?

Uploaded by

Vj Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

TABLE OF CONTENTS

ChapterNo. TITLE Page No.

ABSTRACT 8
LIST OF FIGURES vii
1 INTRODUCTION 9

2 LITERATURE SURVEY 10
2.1. Detection of Phishing URL using Machine Learning 10

2.2. A Survey of Machine Learning-Based Solutions for Phishing


23
Website Detection

2.3. Detection of Phishing Websites using MachineLearning 44

2.4 Detecting Phishing Websites Using Machine Learning 52

3 METHODOLOGY 61
v
3.1. EXISTING SYSTEM 61

3.2. PROPOSED SYSTEM 62

3.3. SYSTEM ARCHITECTURE 62

3.4. WORKING OFDECISION TREE 63

4 RESULTS AND DISCUSSION

76
4.1. MACHINE LEARNING

5 CONCLUSION

87
5.1.CONCLUSION

6 CONCLUSION 90

6.1. CONCLUSION

90
REFERENCES

APPENDICES

A. SOURCE CODE 92
B. SCREENSHOTS 95
C. PLAGIARISM REPORT 97
D. JOURNAL PAPER 99

LIST OF FIGURES
vi
Figure No. Figure Name Page No.

3.1 Block Diagram 62

3.2 Flow Diagram 63

3.3 Real Life Analogy 70

3.4 Bagging Parallel & Boosting Sequential 71

3.5 Bagging 72

3.6 Bagging Ensemble Method 73

4.1 Model 80

4.2 Clustering 83

4.3 ML Overview 84

4.4 Training and Testing 85

4.5 Validation Test 85

vii
ABSTARCT:

Phishing websites have proven to be a major security concern. Several cyber attacks risk the
confidentiality, integrity, and availability of company and consumer data, and phishing is the
beginning point for many of them.Many researchers have spent decades creating unique
approaches to automatically detect phishing websites. While cutting-edge solutions can deliver
better results, they need a lot of manual feature engineering and aren't good at identifying new
phishing attacks. As a result, finding strategies that can automatically detect phishing websites and
quickly manage zero-day phishing attempts is an open challenge in this field. The web page in the
URL which hosts that contains a wealth of data that can be used to determine the web server's
maliciousness.Machine Learning is an effective method for detecting phishing.It also eliminates the
disadvantages of the previous method.We conducted a thorough review of the literature and
suggested a new method for detecting phishing websites using features extraction and a machine
learning algorithm. The goal of this research is to use the dataset collected to train ML models and
deep neural nets to anticipate phishing websites.

8
Chapter 1

INTRODUCTION:

Phishing is the most unsafe criminal exercises in cyber space. Since most of the users go online to
access the services provided by government and financial institutions, there has been a significant
increase in phishing attacks for the past few years. Phishers started to earn money and they are
doing this as a successful business. Various methods are used by phishers to attack the
vulnerable users such as messaging, VOIP, spoofed link and counterfeit websites. It is very easy
to create counterfeit websites, which looks like a genuine website in terms of layout and content.
Even, the content of these websites would be identical to their legitimate websites. The reason for
creating these websites is to get private data from users like account numbers, login id, passwords
of debit and credit card, etc. Moreover, attackers ask security questions to answer to posing as a
high level security measure providing to users. When users respond to those questions, they get
easily trapped into phishing attacks. Many researches have been going on to prevent phishing
attacks by different communities around the world. Phishing attacks can be prevented by detecting
the websites and creating awareness to users to identify the phishing websites. Machine learning
algorithms have been one of the powerful techniques in detecting phishing websites. In this study,
various methods of detecting phishing websites have been discussed.

9
Chapter 2

Literature review

2.1 Detection of Phishing URL using Machine Learning

Abstract:

Phishing websites have proven to be a major security concern. Several cyberattacks


risk the confidentiality, integrity, and availability of company and consumer data, and phishing is
the beginning point for many of them. Many researchers have spent decades creating unique
approaches to automatically detect phishing websites. While cutting-edge solutions can deliver
better results, they need a lot of manual feature engineering and aren't good at identifying new
phishing attacks. As a result, finding strategies that can automatically detect phishing websites and
quickly manage zero-day phishing attempts is an open challenge in this field. The web page in the
URL which hosts that contains a wealth of data that can be used to determine the web server's
maliciousness. Machine Learning is an effective method for detecting phishing.It also eliminates
the disadvantages of the previous method. We conducted a thorough review of the literature and
suggested a new method for detecting phishing websites using features extraction and a machine
learning algorithm. The goal of this research is to use the dataset collected to train ML models and
deep neural nets to anticipate phishing websites.

INTRODUCTION

Phishing has become the most serious problem, harming individuals, corporations, and even entire
countries. The availability of multiple services such as online banking, entertainment, education,
software downloading, and social networking has accelerated the Web's evolution in recent years.
As a result, a massive amount of data is constantly downloaded and transferred to the Internet.
Spoofed e-mails pretending to be from reputable businesses and agencies are used in social
engineering techniques to direct consumers to fake websites that deceive users into giving
financial information such as usernames and passwords. Technical tricks involve the installation of
malicious software on computers to steal credentials directly, with systems frequently used to
intercept users' online account usernames and passwords.

A. Types of Phishing Attacks

10
• Deceptive Phishing:
This is the most frequent type of phishing assault, in which a
Cyber criminal impersonates a well-known institution, domain, or organization to acquire sensitive
personal information from the victim, such as login credentials, passwords, bank account
information, credit card information, and so on. Because there is no personalization or
customization for the people, this form of attack lacks sophistication.
•Spear Phishing: Emails containing malicious URLs in this sort of phishing email contain a lot of
personalization information about the potential victim. The recipient's name, company name,
designation, friends, co-workers, and other social information may be included in the email.
•Whale Phishing: To spear phish a "whale," here a top-level executive such as CEO, this sort of
phishing targets corporate leaders such as CEOs and top-level management employees.
• URL Phishing: To infect the target, the fraudster or cyber-criminal employs a URL link. People
are sociable creatures who will eagerly click the link to accept friend invitations and may even be
willing to disclose personal information such as email addresses.
This is because the phishers are redirecting users to a false web server. Secure browser
connections are also used by attackers to carry out their unlawful actions. Due to a lack of
appropriate tools for combating phishing attacks, firms are unable to train their staff in this area,
resulting in an increase in phishing attacks.Companies are educating their staff with mock phishing
assaults, updating all their systems with the latest security procedures, and encrypting important
Information as broad countermeasures. Browsing without caution is one of the most common ways
to become a victim of this phishing assault. The appearance of phishing websites is like that of
authentic websites.

Research question:
Are some of the research questions on which this research paper will elaborate.
• Is it possible to extract features from the URL using machine learning techniques?
• How can phishing URLs be detected using a Machine learning approach in terms of
efficiency?
The ultimate purpose of this study work is to provide a better understanding of the process of
identifying the presence of Phishing attacks using a machine learning technique to identify URL
based features like Address Bar, Domain, JavaScript, and HTML based features. The remaining
part of the paper is written out as follows. The Section 2 of paper is dedicated to a literature review.

11
Section 3 outlines the planned research approach, Section 4 presents the experimental data, and
Section 5 provides the conclusion.

Literature Review

Many scholars have done some sort of analysis on the statistics of phishing URLs. Our technique
incorporates key concepts from past research. We review past work in the detection of phishing
sites using URL features, which inspired our current approach. Happy describe phishing as "one of
the most dangerous ways for hackers to obtain users' accounts such as usernames, account
numbers and passwords, without their awareness." Users are ignorant of this type of trap and will
ultimately, they fall into Phishing scam. This could be due to a lack of a combination of financial aid
and personal experience, as well as a lack of market awareness or brand trust. In this article,
Mehmet et al. suggested a method for phishing detection based on URLs. To compare the results,
the researchers utilized eight different algorithms to evaluate the URLs of three separate datasets
using various sorts of machine learning methods and hierarchical architectures. The first method
evaluates various features of the URL; the second method investigates the website's authenticity
by determining where it is hosted and who operates it; and the third method investigates the
website's graphic presence.We employ Machine Learning techniques and algorithms to analyse
these many properties of URLs and websites. Garera et al. classify phishing URLs using logistic
regression over hand-selected variables. The inclusion of red flag keywords in the URL, as well as
features based on Google's Web page and Google's Page Rank quality recommendations, are
among the features. Without access to the same URLs and features as our approach, it's difficult
to conduct a direct comparison. In this research, Yong et al. created a novel approach for detecting
phishing websites that focuses on detecting a URL which has been demonstrated to be an
accurate and efficient way of detection. To offer you a better idea, our new capsule-based neural
network is divided into several parallel components. One method involves removing shallow
characteristics from URLs. The other two, on the other hand, construct accurate feature
representations of URLs and use shallow features to evaluate URL legitimacy. The final output of
our system is calculated by adding the outputs of all divisions. Extensive testing on a dataset
collected from the Internet indicate that our system can compete with other cutting-edge detection
methods while consuming a fair amount of time. For phishing detection, Vahid Shahrivari et al.
used machine learning approaches. They used the logistic regression classification method, KNN,
12
Adaboost algorithm, SVM, ANN and random forest. They found random forest algorithm provided
good accuracy. Dr.G. Ravi Kumar used a variety of machine learning methods to detect phishing
assaults. For improved results, they used NLP tools. They were able to achieve high accuracy
using a Support Vector Machine and data that had been pre-processed using NLP approaches.
Amani Alswailem et al. tried different machine learning model for phishing detection but was able
to achieve more accuracy in random forest. Hossein et al. created the ―Fresh-Phish‖ open-source
framework. This system can be used to build machine-learning data for phishing websites. They
used a smaller feature set and built the query in Python. They create a big, labelled dataset and
test several machine-learning classifiers on it. Using machine-learning classifiers, this analysis
yields very high accuracy. These studies look at how long it takes to train a model. X. Zhang
suggested a phishing detection model based on mining the semantic characteristics of word
embedding, semantic feature, and multi-scale statistical features in Chinese web pages to detect
phishing performance successfully. To obtain statistical aspects of web pages, eleven features
were retrieved and divided into five classes. To obtain statistical aspects of web pages, eleven
features were retrieved and divided into five classes. To learn and evaluate the model, AdaBoost,
Bagging, Random Forest, and SMO are utilized. The legitimate URLs dataset came from
DirectIndustry online guides, and the phishing data came from China's Anti-Phishing Alliance. With
novel methodologies, M. Aydin approaches a framework for extracting characteristics that is
versatile and straightforward. Phish Tank provides data, and Google provides authentic URLs. C#
programming and R programming were utilized to btain the text attributes. The dataset and third-
party service providers yielded a total of 133 features. The feature selection approaches of CFS
subset based and Consistency subset-based feature selection were employed and examined with
the WEKA tool. The performance of the Nave Bayes and Sequential Minimal Optimization (SMO)
algorithms was evaluated, and the author prefers SMO to NB for phishing detection.

Research Methodology

A phishing website is a social engineering technique that imitates legitimate webpages and uniform
resource locators (URLs). The Uniform Resource Locator (URL) is the most common way for
phishing assaults to occur. Phisher has complete control over the URL's sub-domains. The phisher
can alter the URL because it contains file components and directories.
Methodologies
13
This research used the linear-sequential model, often known as the waterfall model. Although the
waterfall approach is considered conventional, it works best in instances where there are few
requirements. The application was divided into smaller components that were built using
frameworks and hand-written code.

Research Framework:

The steps of this research in which some selected publications were read to determine the
research gap and, as a result, the research challenge was defined. Feature selection, classification
and phishing website detection were all given significant consideration. It's worth noting that most
phishing detection researchers rely on datasets they've created. However, because the datasets
utilized were not available online for those who use and check their results, it is difficult to assess
and compare the performance of a model with other models. As a result, such results cannot be
generalized.

Language

For the preparation of this dissertation, I used Python as the primary language. Python is a
language that is heavily focused on machine learning. It includes several machine learning libraries
that may be utilized straight from an import. Python is commonly used by developers all around the
world to deal with machine learning because of its extensive library of machine learning libraries.
Python has a strong community, and as a result, new features are added with each release.

Data Collection

The phishing URLs were gathered using the open source tool Phish Tank. This site provides a set
of phishing URLs in a variety of forms, including csv, json, and others, which are updated hourly.
This dataset is used to train machine learning models with 5000 random phishing URLs.

Data Cleaning

Fill in missing numbers, smooth out creaking data, detect and delete outliers, and repair anomalies
to clean up the data.
14
Data Pre-processing

Data pre-processing is a cleaning operation that converts unstructured raw data into a neat, well-
structured dataset that may be used for further research. Data pre-processing is a cleaning
operation that transforms unstructured raw data into well-structured and neat dataset which can be
used for further research.

Extraction of Features

In the literature and commercial products, there are numerous algorithms and data formats for
phishing URL detection. A phishing URL and its accompanying website have various
characteristics that distinguish them from harmful URLs. For example, to mask the true domain
name, an attacker can create a long and complicated domain name. Different types of features
that are used in machine learning algorithms in the academic study detection process are used.
The following is a list of features gathered from academic studies for phishing domain detection
using machine learning approaches. Because of some constraints, it may not be logical to use
some of the features in specific instances. Using Content-Based Features to construct a quick
detection mechanism capable of analyzing a huge number of domains may not be feasible. Page-
Based Features are not very effective when analyzing registered domains. As a result, the features
that the detection mechanism will use are determined by the detection mechanism's purpose. So,
which features should be used in the detecting technique been carefully chosen.

Models and Training

The data is split into 8000 training samples and 2000 testing samples, before the ML model is
trained. It is evident from the dataset that this is a supervised machine learning problem.
Classification and regression are the two main types of supervised machine learning issues.
Because the input URL is classed as legitimate or phishing, this data set has a classification
problem. The following supervised machine learning models were examined for this project's
dataset training:

• Decision Tree
• Multilayer Perceptron

15

You might also like