0% found this document useful (0 votes)
13 views15 pages

Detecting Phishing Domains Using Deep Learning

Uploaded by

tushar n
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views15 pages

Detecting Phishing Domains Using Deep Learning

Uploaded by

tushar n
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Detecting Phishing Domains Using 1

Deep Learning

BMS INSTITUTE OF TECHNOLOGY AND MANAGEMENT


(An Autonomous Institute affiliated to VTU, Belagavi, Approved by AICTE New Delhi)
Yelahanka, Bengaluru 560064

Department of Computer Science and Engineering

Synopsis for the Project work

“Detecting Phishing Domains Using Deep Learning”

Submitted By:

1. HASAN EBRAHIM HADDAD BIN SUMAIT 1BY20CS060

2. SAEED AHMED SAEED ALOJILY 1BY20CS159

3. S MOHAMMED SUHAIL 1BY19CS127

4. SHREYAS POOJARI 1BY21CS414

Under the Guidance of

Dr. ARCHANA R A

BMSIT&M, Department of CSE 2024-2023


Detecting Phishing Domains Using 2
Deep Learning

ABSTRACT

The expanding digital landscape of the modern era has accentuated the emergence of
sophisticated cybersecurity threats, with phishing attacks residing at the forefront of this precarious
milieu. Phishing attacks are manifested when malevolent actors ingeniously fabricate replicas of
legitimate websites with the primary intention to clandestinely extract confidential user data. This
perilous mechanism not only undermines digital trust but also portends significant financial and
personal losses. Traditional methodologies designed to counteract these threats primarily center
around static heuristics, regularly updated blacklists, and basic pattern recognition. However, these
approaches invariably grapple with shortcomings: they are often reactionary, are ill-equipped to
detect newly crafted phishing techniques, and can lead to a considerable number of false negatives,
leaving users unwittingly exposed. To address these gaping vulnerabilities, our research
promulgates the deployment of advanced deep learning techniques, specifically emphasizing the
integrative use of Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN).
These neural architectures are renowned for their prowess in discerning intricate patterns within
data sequences, making them apt candidates for dissecting and analyzing the structural intricacies
inherent within Uniform Resource Locators (URLs). By harnessing these architectures, our
research aims to transcend conventional phishing detection paradigms, crafting a model rooted in
adaptability, precision, and real-time detection. This research thus not only contributes to the
academic discourse on cybersecurity solutions but also offers tangible tools to redefine the efficacy
of phishing detection mechanisms in the constantly evolving cyber ecosystem.

BMSIT&M, Department of CSE 2024-2023


Detecting Phishing Domains Using 3
Deep Learning

INTRODUCTION
Cybersecurity, a domain that has long been a fulcrum of digital evolution, is persistently
challenged by the changing facades of online threats. Among these, phishing has emerged as a
particularly tenacious adversary. At its core, phishing is a digital masquerade—cyber attackers
meticulously design web replicas that mirror the aesthetics and functionalities of their legitimate
antecedents. Such simulacrums, while appearing benign, harbor a malicious intent: to hoodwink
users into divulging sensitive data, be it financial credentials, personal information, or secure
access codes. The implications of successful phishing attacks are profound, ranging from
individual financial losses to wide-scale breaches of institutional databases, ultimately eroding the
bedrock of digital trust that modern online ecosystems heavily rely upon.

Existing countermeasures, though diverse, predominantly hinge on static strategies, from


predefined heuristic rules to periodically updated blacklists. While these methods have provided a
semblance of a defensive bulwark, they are inherently reactive. Their post-facto nature often
renders them obsolete when confronted with novel, innovative phishing ploys. This latency in
threat detection and response underscores a pressing need for more dynamic, adaptive, and
anticipatory solutions.

Recognizing this exigency, our research delves into the untapped reservoir of deep learning
techniques, probing their potential to revolutionize phishing detection. By leveraging advanced
neural architectures such as CNNs and RNNs, renowned for their capability to discern subtle
patterns in sequential data, we propose a robust framework aimed at dissecting URLs and their
embedded characteristics. The objective is clear: to pivot from the traditional, often belated,
reactive models to a more preemptive, agile, and holistic paradigm that can rise to the challenges
of the ever-morphing phishing landscape.

BMSIT&M, Department of CSE 2024-2023


Detecting Phishing Domains Using 4
Deep Learning

LITERATURE SURVEY
Traditional Machine Learning in Phishing Detection:
Machine learning has been a topic of interest in phishing detection for the past decade.
Early works focused on supervised learning algorithms to distinguish between phishing and
legitimate websites. Researchers like Ma et al. (2010) explored the use of decision trees, SVMs,
and naive Bayes classifiers, relying on features like the URL's length, age of the domain, and the
presence of certain keywords. These models showcased decent accuracy rates, but their
performance was limited by the static nature of the features used.

Feature Engineering for Enhanced Detection:


Delving deeper into feature engineering, studies like that by Gupta et al. (2013) suggested
that the richness of the features plays a crucial role in the model's efficacy. The researchers
experimented with features beyond just the URL structure—such as SSL certificate status, domain
registrar details, and webpage content analysis. By integrating a more comprehensive set of
features, these studies aimed to create a more holistic understanding of what constitutes a phishing
site, thereby improving detection rates.

Deep Learning's Foray into Phishing Detection:


The application of deep learning in phishing detection is relatively recent. Lee et al. (2017)
were among the pioneers who recognized the potential of neural networks in this domain. They
argued that while traditional machine learning models were adept at pattern recognition, they often
missed out on intricate patterns and inter-feature relationships that deep learning models, like
CNNs, could capture. Their work demonstrated that CNNs, when trained on raw URL sequences,
could effectively detect disguised malicious patterns within URLs.

BMSIT&M, Department of CSE 2024-2023


Detecting Phishing Domains Using 5
Deep Learning

Neural Architectures Beyond CNNs:


With the success of CNNs, there was a natural progression towards exploring other neural
architectures, like RNNs. Zhang et al. (2019) employed RNNs, especially LSTMs, emphasizing
their capability to process URLs as sequential data. Given that URLs have inherent sequential
patterns, RNNs could capture long-term dependencies between different segments of a URL,
potentially offering better insights into its malicious nature.

Comparative Studies and Hybrid Approaches:


Finally, literature is rife with comparative studies, attempting to benchmark the
performance of various models. Kim and Choi (2020) for instance, compared CNNs, RNNs, and
traditional machine learning techniques, concluding that while deep learning models generally
outperformed their traditional counterparts, there was a case to be made for hybrid models. These
models combined the strengths of both CNNs and RNNs, aiming to capture both the sequential
nature and intricate patterns of URLs to offer optimal detection capabilities.

BMSIT&M, Department of CSE 2024-2023


Detecting Phishing Domains Using 6
Deep Learning

LIMITATIONS OF EXISTING SYSTEMS


Static Nature and Lack of Adaptability:
One of the most salient drawbacks of existing phishing detection systems is their static
nature. Many traditional systems, particularly those relying on blacklists or predefined rules,
operate on a set dataset or a fixed set of patterns. As a result, they are limited to detecting known
phishing URLs or patterns. In the rapidly evolving landscape of phishing strategies, where
attackers constantly innovate their tactics, these static models fail to recognize new and emerging
threats, leading to significant vulnerabilities.

High False Positives and Negatives:


Another challenging issue is the balance between false positives and false negatives. Rule-
based systems, in their attempt to cast a wide net, often misclassify legitimate sites as phishing
sites (false positives). Conversely, they may also overlook subtle, novel phishing tactics, allowing
malicious sites to go undetected (false negatives). These misclassifications not only endanger
unsuspecting users but can also erode trust in the detection system itself, especially if genuine
websites are frequently flagged.

Latency in Updates:
Systems that rely on regularly updated databases or blacklists face a latency challenge. The
time interval between the emergence of a new phishing URL and its addition to the blacklist can
be a window of opportunity for attackers. During this period, unsuspecting users remain exposed
to the new threat, highlighting the system's reactive rather than proactive nature.

BMSIT&M, Department of CSE 2024-2023


Detecting Phishing Domains Using 7
Deep Learning

Scalability Concerns:

With the sheer volume of websites and URLs coming online daily, scalability
becomes a pressing concern. Traditional systems, especially those requiring manual
updates or human intervention, struggle to scale with the explosive growth of the
internet. This scalability challenge means that as the web expands, the system's
efficiency and coverage can decrease.

Over-reliance on Surface-level Features:


Many conventional phishing detection methods focus on easily discernible, surface-level
features of websites, such as URL length or specific keyword presence. While these features offer
some level of detection accuracy, they barely scratch the surface of the complex web of indicators
that can hint at a website's malicious intent. This superficial analysis can be easily circumvented
by sophisticated attackers, further weakening the system's efficacy.

BMSIT&M, Department of CSE 2024-2023


Detecting Phishing Domains Using 8
Deep Learning

SYSTEM REQUIREMENT SPECIFICATIONS

Functional Requirements:

1. URL Processing Capability:


The system must be adept at ingesting and analyzing URLs, regardless of their length,
structure, or origin. It should be able to dissect URLs into their constituent components for feature
extraction.

2. Phishing Detection Mechanism:


At the core of the system, there should be an effective mechanism to discern and flag
potential phishing URLs. This mechanism should utilize deep learning models to evaluate the
legitimacy of a website based on the extracted features from its URL.

3. Continuous Learning Module:


The system must have the capability to learn continuously from new data, ensuring it
remains updated in the face of evolving phishing strategies. This involves retraining or fine-tuning
the underlying models with new data samples.

4. User Feedback Integration:


To bolster the system's accuracy, a mechanism should be in place to collect and integrate
user feedback. If a user disputes a detection result, this feedback should be processed and utilized
to refine the model.

5. Reporting & Alert System:


Once a potential phishing URL is detected, the system must be equipped to notify the
concerned parties or stakeholders, be it end-users or system administrators. This alert mechanism
should be both timely and informative.

BMSIT&M, Department of CSE 2024-2023


Detecting Phishing Domains Using 9
Deep Learning

Non-Functional Requirements:

1. Accuracy & Precision:


The system should maintain a high level of accuracy in its detection capabilities,
minimizing both false positives and negatives. This ensures that legitimate sites are not wrongfully
flagged and malicious sites are not overlooked.

2. Real-time Processing:
In the digital age, delays can be costly. The system must be designed to operate in real-
time or near-real-time, swiftly processing URLs and generating detection results without
noticeable lag.

3. Scalability:
With the vast expanse of the internet and the constant influx of new websites, the system
must be scalable. It should be capable of handling large datasets, sudden spikes in traffic, and
expanding its operations as required without compromising performance.

4. Robustness & Resilience:


The system should be robust enough to handle various challenges, from unexpected data
formats to potential cyber-attacks on the system itself. It should also be resilient, ensuring
consistent operation even under adverse conditions.

5. Data Security & Privacy:


Given that the system deals with URLs and potentially sensitive data, it should incorporate
stringent data security and privacy measures. This includes encryption, data anonymization, and
adherence to global data protection standards.

BMSIT&M, Department of CSE 2024-2023


Detecting Phishing Domains Using 10
Deep Learning

PROPOSED METHODOLOGY

1. Data Collection and Preprocessing:


The initial step involves aggregating a comprehensive dataset containing both legitimate
and phishing URLs. These will be obtained from multiple cybersecurity repositories, academic
publications, and real-time web crawlers. Once the data is compiled, preprocessing will be initiated
to clean and format the URLs, ensuring the dataset is conducive to deep learning applications.

2. Feature Extraction:
The essence of our methodology lies in converting raw URLs into analyzable features. This
involves breaking down the URLs to capture attributes such as their structural patterns, domain
details, and other latent indicators that may hint at the nature of the website.

3. Model Development:
With the features ready, we will venture into the core of our methodology: the development
of deep learning models. Three distinct architectures will be crafted:

- Convolutional Neural Network (CNN): Primarily focusing on detecting patterns within the
URL structure.

- Recurrent Neural Network (RNN): This model will emphasize capturing the sequential order
and relationships within the URL.

- Hybrid Model (CNN-RNN): A combination of CNN and RNN, this model will aim to harness
the strengths of both architectures, capturing patterns while considering sequential relationships.

4. Validation and Testing:


To prevent overfitting and ensure the robustness of our models, we will employ a rigorous
validation and testing regimen. The dataset will be split into training, validation, and testing sets.
The validation set will be instrumental in tuning the model parameters, while the testing set will
gauge the final model's performance, ensuring its accuracy and generalizability.

BMSIT&M, Department of CSE 2024-2023


Detecting Phishing Domains Using 11
Deep Learning

5. Model Comparison:
Upon finalizing the training and testing phases, an exhaustive comparison will ensue. Each
of the three models - CNN, RNN, and the hybrid CNN-RNN - will be juxtaposed against one
another in terms of their detection accuracy, precision, recall, and F1-score. This comparative
analysis will guide the selection of the most optimal model for deployment.

6. Deployment as a Web Application:


The culmination of our methodology will see the best-performing model being transitioned
into a fully functional web application. This application will allow users to input URLs and receive
real-time evaluations of their legitimacy. By translating our deep learning models into an
accessible and user-friendly platform, we aim to provide a tangible solution to the persistent
threat of phishing.

BMSIT&M, Department of CSE 2024-2023


Detecting Phishing Domains Using 12
Deep Learning

PROPOSED SYSTEM OVERVIEW BASED ON THE


WORKFLOW DIAGRAM

The proposed system, as visualized in the provided workflow diagram, follows a structured
and streamlined process to ensure the optimal detection of phishing attempts. Here's a detailed
breakdown:

Figure 1: System Structure

Figure 2: Convolutional neural networks (CNN)

BMSIT&M, Department of CSE 2024-2023


Detecting Phishing Domains Using 13
Deep Learning

Figure 3: Recurrent neural network (RNN)

1. Data Collection:
The initial step in the system, where data pertinent to the problem domain is gathered. This
could be URLs, website content, or any other data type relevant to phishing detection. The quality
and diversity of the collected data play a critical role in the subsequent stages and in the system's
overall efficacy.

2. Data Preprocessing:
Following collection, the data undergoes preprocessing to ensure it's in the right format and
free from inconsistencies or noise. Tasks such as normalization, handling missing values, and
removing outliers are performed in this step, ensuring that the data is primed for feature extraction.

3. Feature Extraction:
Post-preprocessing, the processed data is further refined to extract meaningful features that will
be instrumental for the model training phase. This step involves the transformation of raw data
into an organized set of variables (features) that are more informative and relevant to the detection
task.

BMSIT&M, Department of CSE 2024-2023


Detecting Phishing Domains Using 14
Deep Learning

4. Model Training:
With the features in hand, the system delves into the core task of model training. Here, the
chosen algorithms (like CNN, RNN, or their combination) are taught to distinguish between
legitimate and phishing data points using the extracted features. The models learn the intricate
patterns and nuances, refining their internal parameters for better accuracy.

5. Testing and Validation:


After training, the models are rigorously evaluated using unseen data. This step is vital to
ensure that the models don't just memorize the training data (overfitting) but generalize well to
new, unseen instances. Through testing and validation, the system can gauge the performance
metrics of each model and their efficacy in real-world scenarios.

6. Deployment:
Once a model (or models) has been validated and deemed satisfactory, the final step is its
deployment as a web application. In this phase, the model becomes accessible to end-users, where
they can input data (e.g., URLs) and receive instantaneous feedback on whether it's a potential
phishing attempt or a legitimate entity.

By systematically following this workflow, the proposed system ensures a robust and reliable
approach to phishing detection, emphasizing data quality, model efficiency, and user accessibility.

BMSIT&M, Department of CSE 2024-2023


Detecting Phishing Domains Using 15
Deep Learning

REFERENCES

Research Papers:

a. A comprehensive survey of AI-enabled phishing attacks detection techniques, Abdul Basit1,


Maham Zafar1, Xuan Liu2, Abdul Rehman Javed3, Zunera Jalil3, Kashif Kifayat

b. A Deep Learning-Based Framework for Phishing Website Detection, LIZHEN TANG AND
QUSAY H. MAHMOUD (Senior Member, IEEE) c. Tang, L., & Yue, T. X. (2018). Machine
learning for water consumption prediction in smart homes. Water Resources Research, 54(3),
1602-1619.

c. Alnemari, S.; Alshammari, M. Detecting Phishing Domains Using Machine Learning.


Appl.Sci. 2023, 13, 4649. https://doi.org/10.3390/app13084649

BMSIT&M, Department of CSE 2024-2023

You might also like