Minfinal
Minfinal
1
INDEX
DESCRIPTION Page
Abstract 1
Index 2
Chapter 1: Introduction 5
1.1 Introduction 6
1.2 Literature Survey 7-8
Chapter 2: Methodology
2.1 Working 9-11
2.2 Block Diagram 12
2.2.1 Description of Block diagram 13
Chapter 5: References 19
LIST OF FIGURES
DESCRIPTION PAGE
2
CHAPTER 1: INTRODUCTION
1.1 Introduction:
In recent times, the internet has witnessed remarkable growth, driven by the proliferation of a
multitude of online services, including online banking, entertainment, education, software
distribution, and social networking. As a result, there has been a substantial increase in the
continuous exchange of data on the World Wide Web. This surge in online activity has
unfortunately created opportunities for malicious individuals to exploit and compromise critical
personal and financial information, such as usernames, passwords, account details, and national
identification numbers. These malicious activities are commonly referred to as Web phishing
attacks, and they stand out as a significant challenge within the realm of web security.
Given the inevitability of phishing websites targeting a wide range of online entities, including
businesses, financial institutions, internet users, and governmental organizations, it has become
imperative to proactively address the issue of Web phishing attacks at their inception.
Nevertheless, the identification of phishing websites remains a formidable task, owing to the
numerous inventive techniques employed by phishing attackers to mislead internet users.
Phishing stands as one of the most perilous criminal activities in the realm of cyberspace.
With an increasing number of users turning to the internet to access essential government and
financial services, there has been a notable surge in phishing attacks over recent years. What's
alarming is that phishing has evolved into a lucrative business for cybercriminals. Phishers have
honed their techniques, using methods like messaging, VOIP (Voice over Internet Protocol),
spoofed links, and the creation of counterfeit websites to target vulnerable users.
The creation of counterfeit websites, which closely mimic the appearance and content of
genuine websites, has become a common practice. These imposter websites are designed to extract
sensitive data from unsuspecting users, including account numbers, login credentials, debit and
credit card information, and more. In some cases, attackers pose as high-level security entities,
asking users security questions under the guise of enhancing security. When users divulge this
information, they inadvertently fall victim to phishing attacks.Machine learning algorithms have
emerged as a powerful tool in the fight against phishing. This study delves into various methods
used for detecting and preventing phishing websites, shedding light on the ongoing efforts to
thwart these malicious activities.
3
1.2 LITERATURE SURVEY
We have studied the literature surveys which gives the following information:
Waleed Ali Department of Information Technology Faculty of Computing and Information
Technology, King Abdulaziz University Rabigh, Kingdom of Saudi Arabia. This is from the paper
of (IJACSA) International Journal of Advanced Computer Science and Applications.
In the ongoing effort to combat the escalating threat of phishing, a multitude of innovative
approaches have been put forward by researchers. These methods are designed to bolster the
detection and prevention of phishing attacks. For instance, one approach involves the application
of a support vector machine, utilizing 12 distinct features, to identify phishing pages. Barraclough
has harnessed a Neuro-Fuzzy scheme, incorporating five critical inputs, such as legitimate site
rules, user behavior profiles, PhishTank data, user-specific websites, and email pop-up analysis,
to achieve real-time and highly accurate phishing website detection. Mohammad's contributions
include rule-based data mining classification techniques with 17 different features to effectively
distinguish phishing from legitimate websites, along with the proposal of an intelligent model
based on self-structuring neural networks for predicting and mitigating phishing attacks.
Abdelhamid, on the other hand, introduced the Multi-Label Classifier based Associative
Classification (MCAC) approach to identify phishing websites. Furthermore, various classification
techniques, including neural networks (NN), support vector machines (SVM), naïve Bayes (NB),
decision trees, random forests, and others, have been actively deployed in the ongoing battle
against phishing. These diverse strategies and tools collectively contribute to the comprehensive
endeavor to address this pervasive and evolving online threat.
4
Phishing website detection based on effective machine learning approach by Dr. Gururaj
Harinahalli Lokesh & Goutham BoreGowda, Wireless Inter Networking Research Group
(Wing), Vidyavardhaka College of Engineering, Mysuru, India.
In a separate study, authors presented a novel approach to detect phishing websites using
five different ML algorithms, namely Decision Tree (DT), Random Forest (RF), Gradient Boosting
(GBM), Generalized Linear Model (GLM), and Generalized Additive Model (GAM). They
assessed the accuracy, precision, and recall of each algorithm, extracting 30 website attributes with
Python and evaluating performance using the R programming language. Notably, the top-
performing algorithms were Decision Tree, Random Forest, and GBM.
In another innovative approach, researchers utilized Fuzzy Logic (FL) to detect and
identify phishing websites, emphasizing the efficacy of FL in assessing and identifying such
websites compared to traditional methods. Their approach addressed the inherent "fuzziness" in
traditional website phishing risk assessment, offering an intelligent and robust model for detecting
phishing websites. This model leveraged FL operators to characterize phishing factors and
indicators as fuzzy variables, resulting in six measures and criteria for assessing different
dimensions of phishing website attacks, revealing the significance of URL and domain identity in
the assessment process.
5
CHAPTER 2: METHODOLOGY
2.1 WORKING
Our project is a phishing website tracker that detects whether a website is safe or fraudulent
one.We have used machine learning model to predict the task.It contains multiple codes each has
its own significance such as one for data collection,one for feature extraction,other for designing
web app etc.,
The features.py code is part of a web scraping project that analyzes HTML pages. The code uses
the BeautifulSoup library to parse an HTML file and extract information from it.The functions
defined in the code check whether the HTML file contains certain elements or attributes such as a
title, input fields, buttons, images, links, passwords, email inputs, hidden elements, audio and video
elements, and more.
The code extracts several features from the HTML source code of a website to determine whether
it is a phishing website or not. Here is a brief explanation of how the code extracts some of the
features:
URL length: The code extracts the length of the URL of the website, which is an important feature
because many phishing websites have unusually long URLs.
Number of dots in URL: The code counts the number of dots in the URL of the website, which is
also an important feature because many phishing websites have URLs that contain multiple dots.
Presence of @ symbol in URL: The code checks whether the URL of the website contains the '@'
symbol, which is rarely used in legitimate URLs but is commonly used in phishing URLs.
Presence of a redirecting link: The code checks whether the website has a redirecting link, which
is a common technique used by phishing websites to redirect users to another website where they
can steal their personal information.
Use of iFrames: The code checks whether the website uses iFrames, which are commonly used in
phishing websites to display fake login forms or other fake content.
Presence of a favicon: The code checks whether the website has a favicon, which is a small icon
that appears in the browser's address bar or tab. Phishing websites may not have a favicon or may
use a fake one.
6
By extracting these features from the HTML source code of a website, your code can identify
whether the website is likely to be a phishing website or not.
Overall this code defines a series of functions to extract features from HTML files and creates a
2D array of these features for a set of HTML files in a directory. The extracted features include
information such as whether the HTML file has a title, input, button, image, link, password, audio,
video, footer, form, text area, iframe, text input, navigation, picture, and table, as well as the
number of inputs, buttons, images, paragraphs, scripts, and links, the length of the title and text,
and the number of various HTML elements such as TH, TR, H1, H2, H3, A, IMG, DIV, FIGURE,
META, SPAN, and sources. The code also includes commented-out code to create a Pandas
dataframe from the 2D array.
Coming to data _collection.py code In this part, a CSV file named "verified_online_2.csv"
is read using Pandas and then the URLs are extracted from the 'url' column and stored in a list
called "URL_list".This is a function called "create_structured_data" that takes a list of URLs as
input and returns a list of vectors (structured data) extracted from the HTML content of each URL.
For each URL in the list, the function makes an HTTP request to get the HTML content of the
URL. If the response code is 200 (i.e., the connection is successful), the function uses
BeautifulSoup library to extract specific features from the HTML content of the URL using the
"create_vector" function from the "feature_extraction" module. The extracted features are then
appended to a list called "vector", which also includes the URL. This list is then appended to
another list called "data_list", which is returned by the function. If the HTTP connection is not
successful, the function prints a message and continues to the next URL in the list.
The final output of this code is a CSV file named"structured_data_phishing_2.csv" which contains
structured data extracted from a list of URLs, along with a label column indicating whether the
website is a phishing website or not (in this case, all labels are set to 1, indicating phishing
websites).
7
The structured data includes various features such as the presence of certain HTML elements (e.g.,
title, input, button), the number of such elements, the length of the title, and so on. These features
can be used as input to a machine learning model to classify websites as phishing or legitimate.
The code reads a CSV file containing a list of URLs, selects a subset of URLs, scrapes the content
of each URL, extracts the features, and saves the data to the CSV file.
8
2.2 BLOCK DIAGRAM
The block diagram of the Phishing Website Tracking system is shown below. This shows overall
view of the system. The block that are connected here are the methods used in this project.
As our project evolves, we will continually refine feature extraction techniques to remain
at the forefront of web security. By combining advanced data analysis with cutting-edge machine
learning algorithms, our goal is to provide a robust and reliable tool for website safety assessment.
Our vision extends beyond individual users; we aim to create a safer digital ecosystem where
systems, organizations, and individuals can trust their online experiences, knowing that our
platform is dedicated to preserving their security and privacy. We are excited to embark on this
journey toward a safer digital future, fostering trust and security in the evolving landscape of the
internet.
10
CHAPTER 3: PROCESS & REQUIREMENTS
3.1 SOFTWARE REQUIREMENTS
3.1.1 Visual Studio Code
Visual Studio Code (VS Code) is a highly regarded, lightweight source code editor
developed by Microsoft. It's designed to provide a powerful coding experience while remaining
quick and efficient, making it a popular choice for developers on Windows, macOS, and Linux.
What sets VS Code apart is its extensibility, with a vast ecosystem of extensions available through
the Visual Studio Code Marketplace. These extensions cater to a wide range of programming
languages and tools, allowing developers to customize the editor to their specific needs. With built-
in support for Git version control, intelligent code editing features, debugging capabilities,
integrated terminals, and task automation, VS Code offers a comprehensive development
environment. Its user-friendly interface, themes, and a supportive community make it an excellent
choice for coding and collaboration across various projects and languages. Whether you're a
beginner or an experienced developer, Visual Studio Code can enhance your coding productivity
and streamline your development workflow.
11
specialized IDEs exist, IDLE's simplicity and accessibility make it a favored environment for many
Python programming tasks.
3.1.3 Streamlit
Streamlit is a Python library designed for effortless web app creation, particularly in data
science and machine learning. With Streamlit, you can turn data scripts into interactive web
applications using only a few lines of code. It's an excellent choice for rapid prototyping, enabling
quick iteration on data analysis and visualizations. The library supports a wide range of data
visualization tools and widgets for user interaction. Sharing apps is simple, and Streamlit's
ecosystem includes extensions and deployment options like Streamlit Sharing. This combination
of simplicity and flexibility makes Streamlit a popular choice for data scientists and developers
looking to showcase their work or deploy machine learning models as user-friendly web
applications.
12
CHAPTER 4: RESULTS & CONCLUSIONS
4.1 Result
Just we need to enter the url of the target website.
Here we have typed http://www.vnrvjiet.ac.in/ which is not a phishing site so green bubbles are
up indicating its safe.
Fig-4.1.1: Secure Website. On giving a website link it determined that the given url is legitimate
one. For classifying urls any of machine learning algorithms such as KNN, random forest,
decision tree, can be used which were given in the drop-down menu.
13
Next, we are entering a phishing site http://loteriada-urodzinowa.pl/millenium_payment.php
See the message and warning sign as entered site is phishing.
Fig-4.1.2: Potential phishing website. On giving a website link it determined that the given url is
a potential phishing website. For classifying urls any of machine learning algorithms such as
KNN, random forest, decision tree, can be used which were given in the drop-down menu.
Website link: https://pranaydavath-phishing-website-tracker-app-785uin.streamlit.app/
14
4.2 Conclusion
Our project is the culmination of dedicated efforts to fortify the digital realm and safeguard
it from the perils of web phishing attacks. It stands as a testament to our commitment to secure the
online experiences of users, uphold the integrity of organizations, and ensure the collective safety
of society. With machine learning as our ally and awareness as our weapon, we are not merely
concluding a project; we are launching a transformative movement.
As we venture into this new era, we envision a digital landscape where uncertainty and
apprehension give way to confidence and trust. This isn't just the end of a project; it's the beginning
of a more secure and resilient online world. Our mission is to empower individuals, organizations,
and society with the knowledge and tools to discern authentic websites from deceptive ones. We
are taking the first steps towards a future where every online interaction is characterized by safety,
where digital vulnerabilities are replaced by robust defenses, and where the harmony of the online
ecosystem prevails.
Our project's last words mark a new chapter in the ever-evolving narrative of web security.
They signify a resolute commitment to building a digital world where users can navigate the
internet with confidence, where organizations can conduct business without trepidation, and where
the collective safety of society is upheld. In these concluding moments, we are setting the course
for a more secure, resilient, and trustworthy online future.
15
detection capabilities, behavioural analysis to monitor user interactions with websites, increased
reliance on transfer learning, and a proactive approach to counter adversarial attacks. Multi-modal
data analysis, integrating user education, and encouraging collaboration and data sharing among
organizations for collective defense are expected to be key components. The integration of
phishing detection with other security tools, regulatory considerations, and adaptations for mobile
and IoT security will further shape the landscape of this evolving field, reinforcing the critical role
that machine learning plays in combating phishing threats.
CHAPTER 5: REFERENCES
[1] Gunter Ollmann, “The Phishing Guide Understanding & Preventing Phishing Attacks”,
Internet Security Systems, 2007.
[2] Mahmoud Khonji, Youssef Iraqi, "Phishing Detection: A Literature Survey IEEE, and Andrew
Jones, 2013.
[3] Mohammad R., Thabtah F. McCluskey L., (2015) Phishing websites dataset. Available:
https://archive.ics.uci.edu/ml/datasets/Phishing+Websites Accessed January 2016.
[4] Purbay M., Kumar D, “Split Behavior of Supervised Machine Learning Algorithms for Phishing URL
Detection”, Lecture Notes in Electrical Engineering, vol. 683, 2021, doi: 10.1007/978-981-15-6840-4_40.
[5] Jain A.K., Gupta B.B. “PHISH-SAFE: URL Features-Based Phishing Detection System Using Machine
Learning”, Cyber Security. Advances in Intelligent Systems and Computing, vol. 729, 2018, doi:
10.1007/978-981-10-8536-9_44
[6] Gandotra E., Gupta D, “An Efficient Approach for Phishing Detection using Machine Learning”,
Algorithms for Intelligent Systems, Springer, Singapore, 2021, 10.1007/978-981-15-8711-5_12.
[7] R. S. Rao and S. T. Ali, PhishShield: A Desktop Application to Detect Phishing Webpages through
Heuristic Approach,” Procedia Computer Science, vol. 54, no. Supplement C, pp. 147-156, 2015.
[8] Y. Zhang, J. I. Hong, and L. F. Cranor, Cantina:A Content-based Approach to Detecting Phishing Web
Sites,” New York, NY, USA, 2007,pp. 639-648.
[9] Hassan Y.A.Abutair, AbdelfettahBelghith, “Using Case-Based Reason- ing for Phishing Detection,” in
2017. The 8th International Conference on Ambient SystemsNetworks and Technologies(ANT)
2017,pp.281- 288.
[10] L. Breiman, ”Random Forests,” Machine Learning, vol.45, no. 1, pp.5-32, Oct. 2001.
[11] Sirageldin, B. B. Baharudin, and L. T. Jung, ”Malicious Web Page Detection: A Machine Learning
Approach,” in Advances in Computer Science and its Applications, Springer, Berlin,Heidelberg, 2014,
16
pp.217- 22.
APPENDIX
First the data collection part includes the following lines of code
17
18
19
20
Going to the features part we have the following features for classification
21
Let us now look at the training of the model
Libraries Required
Training datasets
22
For predicting accuracy and precision
23
24
25
26