0% found this document useful (0 votes)
44 views26 pages

Minfinal

The document discusses phishing attacks and methods for detecting phishing websites using machine learning. It begins with an abstract that defines phishing and explains how phishing websites mimic legitimate sites to steal users' private information. It then provides an index and outlines the contents of the document. The introduction further describes the rising threat of phishing and need for effective detection systems. The literature review summarizes previous research applying machine learning algorithms and other techniques to identify phishing websites. The methodology, process, and results sections then explain the proposed approach for a phishing detection system based on data gathering, feature extraction, classification, and evaluation using machine learning.

Uploaded by

Pranaydavath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views26 pages

Minfinal

The document discusses phishing attacks and methods for detecting phishing websites using machine learning. It begins with an abstract that defines phishing and explains how phishing websites mimic legitimate sites to steal users' private information. It then provides an index and outlines the contents of the document. The introduction further describes the rising threat of phishing and need for effective detection systems. The literature review summarizes previous research applying machine learning algorithms and other techniques to identify phishing websites. The methodology, process, and results sections then explain the proposed approach for a phishing detection system based on data gathering, feature extraction, classification, and evaluation using machine learning.

Uploaded by

Pranaydavath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

ABSTRACT

Phishing refers to a deceptive strategy aimed at acquiring sensitive information, including


login details, credit card data, and personal information, by pretending to be a trustworthy source
in digital communications. Phishing websites mimic legitimate sites to deceive individuals into
disclosing their private data. These fraudulent activities are on the rise and growing in complexity,
making it challenging for individuals to identify them. Consequently, it is vital to establish
effective safeguards against phishing attempts.

Phishing represents a prevalent form of cyberattack in which malicious actors employ


counterfeit websites and deceptive emails to coerce individuals into disclosing sensitive
information. As the frequency of phishing attacks continues to rise, the identification and
prevention of such threats pose significant challenges. One strategy to tackle this issue involves
the creation of automated systems designed to swiftly identify phishing websites. This summary
introduces a concept for detecting phishing websites and outlines the fundamental components
essential for constructing such systems. The proposed approach encompasses phases like data
gathering, feature curation, categorization, and assessment. The summary elaborates on a range of
methodologies applied within each of these stages, which encompass the use of machine learning
algorithms, website analysis, and feature extraction. These techniques collectively enable the
development of potent and efficient systems for detecting phishing websites, thereby safeguarding
users from falling victim to these fraudulent activities.

1
INDEX

DESCRIPTION Page
Abstract 1
Index 2
Chapter 1: Introduction 5
1.1 Introduction 6
1.2 Literature Survey 7-8

Chapter 2: Methodology
2.1 Working 9-11
2.2 Block Diagram 12
2.2.1 Description of Block diagram 13

Chapter 3: Process and Requirements


3.1 Software Requirements
3.1.1 VS Code, 14
3.1.2 Python Environment IDLE 14
3.1.2 Streamlit 15

Chapter 4: Result and Conclusion


4.1 Result 16-17
4.2 Conclusion 18
4.3 Future Scope

Chapter 5: References 19

LIST OF FIGURES

DESCRIPTION PAGE

2.2.1 Block Diagram 12


4.1.1 Safe Website 16
4.1.2 Phishing Website 17

2
CHAPTER 1: INTRODUCTION
1.1 Introduction:
In recent times, the internet has witnessed remarkable growth, driven by the proliferation of a
multitude of online services, including online banking, entertainment, education, software
distribution, and social networking. As a result, there has been a substantial increase in the
continuous exchange of data on the World Wide Web. This surge in online activity has
unfortunately created opportunities for malicious individuals to exploit and compromise critical
personal and financial information, such as usernames, passwords, account details, and national
identification numbers. These malicious activities are commonly referred to as Web phishing
attacks, and they stand out as a significant challenge within the realm of web security.

Given the inevitability of phishing websites targeting a wide range of online entities, including
businesses, financial institutions, internet users, and governmental organizations, it has become
imperative to proactively address the issue of Web phishing attacks at their inception.
Nevertheless, the identification of phishing websites remains a formidable task, owing to the
numerous inventive techniques employed by phishing attackers to mislead internet users.

Phishing stands as one of the most perilous criminal activities in the realm of cyberspace.
With an increasing number of users turning to the internet to access essential government and
financial services, there has been a notable surge in phishing attacks over recent years. What's
alarming is that phishing has evolved into a lucrative business for cybercriminals. Phishers have
honed their techniques, using methods like messaging, VOIP (Voice over Internet Protocol),
spoofed links, and the creation of counterfeit websites to target vulnerable users.

The creation of counterfeit websites, which closely mimic the appearance and content of
genuine websites, has become a common practice. These imposter websites are designed to extract
sensitive data from unsuspecting users, including account numbers, login credentials, debit and
credit card information, and more. In some cases, attackers pose as high-level security entities,
asking users security questions under the guise of enhancing security. When users divulge this
information, they inadvertently fall victim to phishing attacks.Machine learning algorithms have
emerged as a powerful tool in the fight against phishing. This study delves into various methods
used for detecting and preventing phishing websites, shedding light on the ongoing efforts to
thwart these malicious activities.

3
1.2 LITERATURE SURVEY
We have studied the literature surveys which gives the following information:
Waleed Ali Department of Information Technology Faculty of Computing and Information
Technology, King Abdulaziz University Rabigh, Kingdom of Saudi Arabia. This is from the paper
of (IJACSA) International Journal of Advanced Computer Science and Applications.

In the ongoing effort to combat the escalating threat of phishing, a multitude of innovative
approaches have been put forward by researchers. These methods are designed to bolster the
detection and prevention of phishing attacks. For instance, one approach involves the application
of a support vector machine, utilizing 12 distinct features, to identify phishing pages. Barraclough
has harnessed a Neuro-Fuzzy scheme, incorporating five critical inputs, such as legitimate site
rules, user behavior profiles, PhishTank data, user-specific websites, and email pop-up analysis,
to achieve real-time and highly accurate phishing website detection. Mohammad's contributions
include rule-based data mining classification techniques with 17 different features to effectively
distinguish phishing from legitimate websites, along with the proposal of an intelligent model
based on self-structuring neural networks for predicting and mitigating phishing attacks.
Abdelhamid, on the other hand, introduced the Multi-Label Classifier based Associative
Classification (MCAC) approach to identify phishing websites. Furthermore, various classification
techniques, including neural networks (NN), support vector machines (SVM), naïve Bayes (NB),
decision trees, random forests, and others, have been actively deployed in the ongoing battle
against phishing. These diverse strategies and tools collectively contribute to the comprehensive
endeavor to address this pervasive and evolving online threat.

4
Phishing website detection based on effective machine learning approach by Dr. Gururaj
Harinahalli Lokesh & Goutham BoreGowda, Wireless Inter Networking Research Group
(Wing), Vidyavardhaka College of Engineering, Mysuru, India.

Researchers have extensively investigated the applicability of machine learning (ML)


techniques for the detection of phishing attacks, providing insights into their strengths and
weaknesses. They have explored various ML algorithms to determine the most suitable choices
for effective anti-phishing tools. One notable contribution includes the development of a Phishing
Classification system, which extracts features designed to outwit conventional phishing detection
methods. This system employs numeric representation and conducts a comparative analysis of
classical ML techniques such as Random Forest, K-nearest neighbors, Decision Tree, Linear SVC
classifier, One-class SVM classifier, and wrapper-based feature selection, utilizing URL metadata
to assess the legitimacy of websites.

In a separate study, authors presented a novel approach to detect phishing websites using
five different ML algorithms, namely Decision Tree (DT), Random Forest (RF), Gradient Boosting
(GBM), Generalized Linear Model (GLM), and Generalized Additive Model (GAM). They
assessed the accuracy, precision, and recall of each algorithm, extracting 30 website attributes with
Python and evaluating performance using the R programming language. Notably, the top-
performing algorithms were Decision Tree, Random Forest, and GBM.

In another innovative approach, researchers utilized Fuzzy Logic (FL) to detect and
identify phishing websites, emphasizing the efficacy of FL in assessing and identifying such
websites compared to traditional methods. Their approach addressed the inherent "fuzziness" in
traditional website phishing risk assessment, offering an intelligent and robust model for detecting
phishing websites. This model leveraged FL operators to characterize phishing factors and
indicators as fuzzy variables, resulting in six measures and criteria for assessing different
dimensions of phishing website attacks, revealing the significance of URL and domain identity in
the assessment process.

These studies collectively contribute to the advancement of anti-phishing measures by


harnessing the capabilities of machine learning and Fuzzy Logic in the detection and mitigation of
phishing threats.

5
CHAPTER 2: METHODOLOGY
2.1 WORKING

Our project is a phishing website tracker that detects whether a website is safe or fraudulent
one.We have used machine learning model to predict the task.It contains multiple codes each has
its own significance such as one for data collection,one for feature extraction,other for designing
web app etc.,

The features.py code is part of a web scraping project that analyzes HTML pages. The code uses
the BeautifulSoup library to parse an HTML file and extract information from it.The functions
defined in the code check whether the HTML file contains certain elements or attributes such as a
title, input fields, buttons, images, links, passwords, email inputs, hidden elements, audio and video
elements, and more.

The code extracts several features from the HTML source code of a website to determine whether
it is a phishing website or not. Here is a brief explanation of how the code extracts some of the
features:

URL length: The code extracts the length of the URL of the website, which is an important feature
because many phishing websites have unusually long URLs.

Number of dots in URL: The code counts the number of dots in the URL of the website, which is
also an important feature because many phishing websites have URLs that contain multiple dots.

Presence of @ symbol in URL: The code checks whether the URL of the website contains the '@'
symbol, which is rarely used in legitimate URLs but is commonly used in phishing URLs.

Presence of a redirecting link: The code checks whether the website has a redirecting link, which
is a common technique used by phishing websites to redirect users to another website where they
can steal their personal information.

Use of iFrames: The code checks whether the website uses iFrames, which are commonly used in
phishing websites to display fake login forms or other fake content.

Presence of a favicon: The code checks whether the website has a favicon, which is a small icon
that appears in the browser's address bar or tab. Phishing websites may not have a favicon or may
use a fake one.

6
By extracting these features from the HTML source code of a website, your code can identify
whether the website is likely to be a phishing website or not.

Nextly coming to feature_extraction.py code it import necessary modules: BeautifulSoup for


parsing HTML, os for accessing files in the file system, and a custom module features.py that
contains various functions used to extract features from HTML content. This function takes a
BeautifulSoup object as input and extracts various features from the HTML content using the
functions defined in features.py. The features extracted include the presence of certain HTML
elements (e.g. "title", "input", "button",), the number of occurrences of certain elements, the length
of the "title" element, etc. The function returns a list of these features.

Overall this code defines a series of functions to extract features from HTML files and creates a
2D array of these features for a set of HTML files in a directory. The extracted features include
information such as whether the HTML file has a title, input, button, image, link, password, audio,
video, footer, form, text area, iframe, text input, navigation, picture, and table, as well as the
number of inputs, buttons, images, paragraphs, scripts, and links, the length of the title and text,
and the number of various HTML elements such as TH, TR, H1, H2, H3, A, IMG, DIV, FIGURE,
META, SPAN, and sources. The code also includes commented-out code to create a Pandas
dataframe from the 2D array.

Coming to data _collection.py code In this part, a CSV file named "verified_online_2.csv"
is read using Pandas and then the URLs are extracted from the 'url' column and stored in a list
called "URL_list".This is a function called "create_structured_data" that takes a list of URLs as
input and returns a list of vectors (structured data) extracted from the HTML content of each URL.
For each URL in the list, the function makes an HTTP request to get the HTML content of the
URL. If the response code is 200 (i.e., the connection is successful), the function uses
BeautifulSoup library to extract specific features from the HTML content of the URL using the
"create_vector" function from the "feature_extraction" module. The extracted features are then
appended to a list called "vector", which also includes the URL. This list is then appended to
another list called "data_list", which is returned by the function. If the HTTP connection is not
successful, the function prints a message and continues to the next URL in the list.

The final output of this code is a CSV file named"structured_data_phishing_2.csv" which contains
structured data extracted from a list of URLs, along with a label column indicating whether the
website is a phishing website or not (in this case, all labels are set to 1, indicating phishing
websites).

7
The structured data includes various features such as the presence of certain HTML elements (e.g.,
title, input, button), the number of such elements, the length of the title, and so on. These features
can be used as input to a machine learning model to classify websites as phishing or legitimate.

The code reads a CSV file containing a list of URLs, selects a subset of URLs, scrapes the content
of each URL, extracts the features, and saves the data to the CSV file.

Example fake urls:


http://www.eki-net.con-aescceccesaas.qfuhtb.top/jp.php
http://loteriada-urodzinowa.pl/millenium_payment.php
https://pl.lnpost-lupappokupka.tech/payment/49952963630/pl/
http://rftktofficial.com
https://track-uspspackage.gotdns.ch/update
http://metamlask.com
Example legitimate urls:
https://vnrvjiet.ac.in/
https://automation.vnrvjiet.ac.in/eduprime3
https://codeforces.com/problemset
https://practice.geeksforgeeks.org/
https://tnp.vnrvjiet.ac.in/login
https://www.codechef.com/practice
https://smartinterviews.in/

8
2.2 BLOCK DIAGRAM
The block diagram of the Phishing Website Tracking system is shown below. This shows overall
view of the system. The block that are connected here are the methods used in this project.

FIG 2.2.1: BLOCK DIAGRAM OF PHISHING WEBSITE TRACKER

2.2.1 DESCRIPTION OF BLOCK DIAGRAM


In this project, we are working to create a user interface that empowers users to assess the
safety and potential threats associated with websites, ensuring not only their personal safety but
also safeguarding systems, societal security, individual information, and confidential user
credentials. With this platform, users can effortlessly differentiate between secure and phishing
websites by pasting the URL into a designated field.
Our project relies on well-crafted code that extracts and selects essential features for
determining a website's safety. These parameters are vital for our pre-designed machine learning
algorithms, which facilitate the automated categorization of websites as either safe or fraudulent.
9
Our system conducts feature extraction by meticulously analyzing various aspects of a website,
including its content, structure, and behavior. Elements considered include security certificates,
server reputation, website age, SSL certificate validity, and other critical attributes. This
comprehensive feature set captures the most vital indicators of a website's authenticity.

As our project evolves, we will continually refine feature extraction techniques to remain
at the forefront of web security. By combining advanced data analysis with cutting-edge machine
learning algorithms, our goal is to provide a robust and reliable tool for website safety assessment.
Our vision extends beyond individual users; we aim to create a safer digital ecosystem where
systems, organizations, and individuals can trust their online experiences, knowing that our
platform is dedicated to preserving their security and privacy. We are excited to embark on this
journey toward a safer digital future, fostering trust and security in the evolving landscape of the
internet.

10
CHAPTER 3: PROCESS & REQUIREMENTS
3.1 SOFTWARE REQUIREMENTS
3.1.1 Visual Studio Code

Visual Studio Code (VS Code) is a highly regarded, lightweight source code editor
developed by Microsoft. It's designed to provide a powerful coding experience while remaining
quick and efficient, making it a popular choice for developers on Windows, macOS, and Linux.
What sets VS Code apart is its extensibility, with a vast ecosystem of extensions available through
the Visual Studio Code Marketplace. These extensions cater to a wide range of programming
languages and tools, allowing developers to customize the editor to their specific needs. With built-
in support for Git version control, intelligent code editing features, debugging capabilities,
integrated terminals, and task automation, VS Code offers a comprehensive development
environment. Its user-friendly interface, themes, and a supportive community make it an excellent
choice for coding and collaboration across various projects and languages. Whether you're a
beginner or an experienced developer, Visual Studio Code can enhance your coding productivity
and streamline your development workflow.

3.1.2 Python Environment IDLE

Python IDLE, or the "Integrated Development and Learning Environment," is an integrated


development environment bundled with Python. It's a versatile tool for Python developers, with a
user-friendly interface that serves multiple purposes. The interactive shell is a standout feature,
offering immediate code execution, making it ideal for quick experimentation and learning. You
can also write and save Python scripts within IDLE, supporting more extensive projects. Its built-
in debugger is a valuable asset for troubleshooting code, allowing users to set breakpoints and step
through their programs. The autocompletion feature helps write code more efficiently, and the
multi-window text editor facilitates working on multiple scripts simultaneously. With syntax
highlighting, easy access to Python documentation, and cross-platform compatibility, Python
IDLE is an excellent choice for both beginners and experienced Python developers. While more

11
specialized IDEs exist, IDLE's simplicity and accessibility make it a favored environment for many
Python programming tasks.

3.1.3 Streamlit

Streamlit is a Python library designed for effortless web app creation, particularly in data
science and machine learning. With Streamlit, you can turn data scripts into interactive web
applications using only a few lines of code. It's an excellent choice for rapid prototyping, enabling
quick iteration on data analysis and visualizations. The library supports a wide range of data
visualization tools and widgets for user interaction. Sharing apps is simple, and Streamlit's
ecosystem includes extensions and deployment options like Streamlit Sharing. This combination
of simplicity and flexibility makes Streamlit a popular choice for data scientists and developers
looking to showcase their work or deploy machine learning models as user-friendly web
applications.

12
CHAPTER 4: RESULTS & CONCLUSIONS
4.1 Result
Just we need to enter the url of the target website.
Here we have typed http://www.vnrvjiet.ac.in/ which is not a phishing site so green bubbles are
up indicating its safe.

Fig-4.1.1: Secure Website. On giving a website link it determined that the given url is legitimate
one. For classifying urls any of machine learning algorithms such as KNN, random forest,
decision tree, can be used which were given in the drop-down menu.

13
Next, we are entering a phishing site http://loteriada-urodzinowa.pl/millenium_payment.php
See the message and warning sign as entered site is phishing.

Fig-4.1.2: Potential phishing website. On giving a website link it determined that the given url is
a potential phishing website. For classifying urls any of machine learning algorithms such as
KNN, random forest, decision tree, can be used which were given in the drop-down menu.
Website link: https://pranaydavath-phishing-website-tracker-app-785uin.streamlit.app/

14
4.2 Conclusion
Our project is the culmination of dedicated efforts to fortify the digital realm and safeguard
it from the perils of web phishing attacks. It stands as a testament to our commitment to secure the
online experiences of users, uphold the integrity of organizations, and ensure the collective safety
of society. With machine learning as our ally and awareness as our weapon, we are not merely
concluding a project; we are launching a transformative movement.

As we venture into this new era, we envision a digital landscape where uncertainty and
apprehension give way to confidence and trust. This isn't just the end of a project; it's the beginning
of a more secure and resilient online world. Our mission is to empower individuals, organizations,
and society with the knowledge and tools to discern authentic websites from deceptive ones. We
are taking the first steps towards a future where every online interaction is characterized by safety,
where digital vulnerabilities are replaced by robust defenses, and where the harmony of the online
ecosystem prevails.

Our project's last words mark a new chapter in the ever-evolving narrative of web security.
They signify a resolute commitment to building a digital world where users can navigate the
internet with confidence, where organizations can conduct business without trepidation, and where
the collective safety of society is upheld. In these concluding moments, we are setting the course
for a more secure, resilient, and trustworthy online future.

4.3 Future Scope


The future scope for phishing website detection using machine learning holds great promise as
advancements in both cyber threats and machine learning techniques continue to unfold.
Anticipated developments include enhanced accuracy in identifying phishing websites, real-time

15
detection capabilities, behavioural analysis to monitor user interactions with websites, increased
reliance on transfer learning, and a proactive approach to counter adversarial attacks. Multi-modal
data analysis, integrating user education, and encouraging collaboration and data sharing among
organizations for collective defense are expected to be key components. The integration of
phishing detection with other security tools, regulatory considerations, and adaptations for mobile
and IoT security will further shape the landscape of this evolving field, reinforcing the critical role
that machine learning plays in combating phishing threats.

CHAPTER 5: REFERENCES
[1] Gunter Ollmann, “The Phishing Guide Understanding & Preventing Phishing Attacks”,
Internet Security Systems, 2007.
[2] Mahmoud Khonji, Youssef Iraqi, "Phishing Detection: A Literature Survey IEEE, and Andrew
Jones, 2013.
[3] Mohammad R., Thabtah F. McCluskey L., (2015) Phishing websites dataset. Available:
https://archive.ics.uci.edu/ml/datasets/Phishing+Websites Accessed January 2016.

[4] Purbay M., Kumar D, “Split Behavior of Supervised Machine Learning Algorithms for Phishing URL
Detection”, Lecture Notes in Electrical Engineering, vol. 683, 2021, doi: 10.1007/978-981-15-6840-4_40.

[5] Jain A.K., Gupta B.B. “PHISH-SAFE: URL Features-Based Phishing Detection System Using Machine
Learning”, Cyber Security. Advances in Intelligent Systems and Computing, vol. 729, 2018, doi:
10.1007/978-981-10-8536-9_44

[6] Gandotra E., Gupta D, “An Efficient Approach for Phishing Detection using Machine Learning”,
Algorithms for Intelligent Systems, Springer, Singapore, 2021, 10.1007/978-981-15-8711-5_12.

[7] R. S. Rao and S. T. Ali, PhishShield: A Desktop Application to Detect Phishing Webpages through
Heuristic Approach,” Procedia Computer Science, vol. 54, no. Supplement C, pp. 147-156, 2015.

[8] Y. Zhang, J. I. Hong, and L. F. Cranor, Cantina:A Content-based Approach to Detecting Phishing Web
Sites,” New York, NY, USA, 2007,pp. 639-648.

[9] Hassan Y.A.Abutair, AbdelfettahBelghith, “Using Case-Based Reason- ing for Phishing Detection,” in
2017. The 8th International Conference on Ambient SystemsNetworks and Technologies(ANT)
2017,pp.281- 288.

[10] L. Breiman, ”Random Forests,” Machine Learning, vol.45, no. 1, pp.5-32, Oct. 2001.

[11] Sirageldin, B. B. Baharudin, and L. T. Jung, ”Malicious Web Page Detection: A Machine Learning
Approach,” in Advances in Computer Science and its Applications, Springer, Berlin,Heidelberg, 2014,

16
pp.217- 22.

APPENDIX
First the data collection part includes the following lines of code

17
18
19
20
Going to the features part we have the following features for classification

21
Let us now look at the training of the model
Libraries Required

Training datasets

22
For predicting accuracy and precision

23
24
25
26

You might also like