0% found this document useful (0 votes)
280 views16 pages

Phishing Detection System Through Hybrid

This document discusses a proposed hybrid machine learning model for phishing detection based on URL attributes. It begins with an abstract that outlines the goals of detecting and preventing phishing attacks more accurately. It then discusses existing phishing detection systems and their limitations. The proposed system would use a hybrid machine learning approach combining logistic regression, support vector machines, and decision trees with feature selection and hyperparameter tuning to more accurately classify phishing URLs. It concludes by discussing the potential advantages of a hybrid model for improved phishing detection performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
280 views16 pages

Phishing Detection System Through Hybrid

This document discusses a proposed hybrid machine learning model for phishing detection based on URL attributes. It begins with an abstract that outlines the goals of detecting and preventing phishing attacks more accurately. It then discusses existing phishing detection systems and their limitations. The proposed system would use a hybrid machine learning approach combining logistic regression, support vector machines, and decision trees with feature selection and hyperparameter tuning to more accurately classify phishing URLs. It concludes by discussing the potential advantages of a hybrid model for improved phishing detection performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

PHISHING DETECTION

SYSTEM THROUGH HYBRID


MACHINE LEARNING BASED
ON URL
ABSTRACT
• Currently, numerous types of cybercrime are organized through the internet. Hence, this study mainly focuses on
phishing attacks. Although phishing was first used in 1996, it has become the most severe and dangerous cybercrime
on the internet. Phishing utilizes email distortion as its underlying mechanism for tricky correspondences, follow1ed
by mock sites, to obtain the required data from people in question. Different studies have presented their work on the
precaution, identification, and knowledge of phishing attacks; however, there is currently no complete and proper
solution for frustrating them. Therefore, machine learning plays a vital role in defending against cybercrimes involving
phishing attacks. The proposed study is based on the phishing URL-based dataset extracted from the famous dataset
repository, which consists of phishing and legitimate URL attributes collected from 11000+ website datasets in vector
form. After preprocessing, many machine learning algorithms have been applied and designed to prevent phishing
URLs and provide protection to the user. This study uses machine learning models such as decision tree (DT), linear
regression (LR), random forest (RF), naive Bayes (NB), gradient boosting classifier (GBM), K-neighbors classifier (KNN),
support vector classifier (SVC), and proposed hybrid LSD model, which is a combination of logistic regression, support
vector machine, and decision tree (LR+SVC+DT) with soft and hard voting, to defend against phishing attacks with high
accuracy and efficiency. The canopy feature selection technique with cross fold valoidation and Grid Search
Hyperparameter Optimization techniques are used with proposed LSD model. Furthermore, to evaluate the proposed
approach, different evaluation parameters were adopted, such as the precision, accuracy, recall, F1-score, and
specificity, to illustrate the effects and efficiency of the models. The results of the comparative analyses demonstrate
that the proposed approach outperforms the other models and achieves the best results.
EXISTING SYSTEM

• Phishing is the most significant issue in the field of networks and the Internet. Many researchers have
attempted to provide facilities to protect users from cyber-attacks by preventing the phishing of URLs using
machine learning, deep learning, black lists, and white lists. Two groups of phishing detection systems have
been proposed and implemented in previous studies: list-based and machine-learning-based phishing
identification systems. This section is divided into two parts: previous list-based and machine-learning-based
studies.
• LIST BASED PHISHING IDENTIFICATION SYSTEM Phishing identification systems based on List use two
different lists white lists and blacklists for the association and classification of authorized and phishing
webpages. Whitlistbased Phishing identification systems produce protected and reliable websites to
produce the required data.
DISADVANTAGES:
1. Complexity: Hybrid models can be complex to design, implement, and maintain, requiring expertise in multiple machine learning
techniques.
2. Resource Intensive: Training and deploying hybrid models can demand more computational resources compared to single-algorithm
models.
3. Interpretability: As hybrid models involve multiple algorithms, understanding why a certain decision was made can be challenging,
impacting the interpretability of the system.
4. Training Data: Developing hybrid models often requires diverse and representative training data for each algorithm, which can be time-
consuming and require careful curation.
5. Hyperparameter Tuning: Hybrid models typically have more hyperparameters to tune, making the optimization process more intricate.
6. Overfitting: With the inclusion of multiple algorithms, there's a risk of overfitting, where the model fits the training data too closely and
performs poorly on new data.
7. Algorithm Compatibility: Integrating different algorithms into a cohesive hybrid system may be challenging due to differences in their
underlying methodologies.
8. Maintenance: Hybrid models may need continuous maintenance and updates as algorithms evolve and new phishing tactics emerge.
9. Model Complexity Trade-off: While hybrid models aim to improve accuracy, there's a trade-off between complexity and performance.
The increased complexity might not always translate into substantial gains in accuracy.
10. Implementation Challenges: Building a hybrid model requires expertise in multiple machine learning algorithms, potentially making the
development process more intricate
PROPOSED SYSTEM

• The major contributions of this study are as follows. • Phishing URL-based cyberattack detection is
proposed in this study to prevent crime and protect people’s privacy. • The dataset consists of 11000+
phishing URL attributes that help classify phishing URLs based on these attributes. • Machine learning
models have been applied, such as decision tree (DT), linear regression (LR), naive Bayes (NB), random
forest (RF), gradient boosting machine (GBM), support vector classifier (SVC), K-Neighbors classifier
(KNN), and the proposed hybrid model (LR+SVC+DT) LSD with soft and hard voting, which can accurately
classify the threats of phishing URLs. • Cross-fold validation with a grid search parameter based on the
canopy feature selection technique was used with the proposed LSD hybrid model to improve
prediction results. • The proposed methodology must be evaluated using evaluation parameters, such
as accuracy, precision, recall, specificity, and F1-score.
ADVANTAGES:

1. Improved Accuracy: Hybrid machine learning models combine the strengths of multiple algorithms, potentially
leading to higher accuracy in phishing detection compared to individual models.
2. Feature Extraction: Hybrid models can effectively extract and combine features from different algorithms,
enabling them to capture a wider range of characteristics that might indicate phishing.
3. Robustness: Hybrid models are often more robust against noise and variations in data, as they can balance out
the weaknesses of individual algorithms.
4. Reduced False Positives: By combining multiple algorithms, hybrid models can mitigate the tendency for false
positives, resulting in fewer legitimate URLs being misclassified as phishing.
5. Adaptability: Hybrid models can be adapted to changing phishing tactics and techniques, as different algorithms
may excel at detecting certain types of phishing attacks.
6. Enhanced Generalization: The combination of different algorithms can lead to improved generalization, allowing
the model to perform well on unseen and evolving phishing URLs.
MODULES

• Data Collection and Preprocessing:


• Feature Engineering:
• Machine Learning Algorithms:
• Individual Model Training:
• Ensemble Creation:
• Hybrid Model Training:
• Evaluation Metrics:
• Model Interpretability (Optional):
• Validation and Testing:
• Hyperparameter Tuning:
•Data Collection and Preprocessing:
•Gather a dataset of URLs, including both legitimate and phishing URLs.
•Extract features from URLs, such as domain, subdomain, path, length, and presence of keywords.
•Normalize and preprocess the URL features, handling missing values and converting categorical
features.
•Feature Engineering:
•Design and implement algorithms to extract meaningful features from URLs.
•Create a feature representation that captures both structural and content-based characteristics.
•Machine Learning Algorithms:
•Choose a set of machine learning algorithms with complementary strengths for phishing detection, such
as decision trees, random forests, gradient boosting, support vector machines, and neural networks.
•Individual Model Training:
•Train each selected algorithm on the preprocessed URL features.
•Optimize hyperparameters using techniques like grid search or random search.
•Ensemble Creation:
•Combine the predictions of individual models using techniques like majority voting, weighted averaging, or
stacking.
•Hybrid Model Training:
•Train a higher-level model that takes the outputs of individual models as features.
•The hybrid model learns to weigh the contributions of individual models for final prediction.
•Evaluation Metrics:
•Implement evaluation metrics such as accuracy, precision, recall, F1-score, and ROC-AUC to assess
the performance of the hybrid model.
•Model Interpretability (Optional):
•Integrate methods like SHAP or LIME to provide explanations for the hybrid model's predictions.
•Validation and Testing:
•Split the dataset into training, validation, and testing sets for model assessment.
•Validate the hybrid model's performance on unseen data using the validation set.
•Hyperparameter Tuning:
•Fine-tune hyperparameters of both individual models and the hybrid model to achieve optimal
performance.
CONCLUSION
• The Internet consumes almost the whole world in the upcoming age, but it is still growing rapidly. With the
growth of the Internet, cybercrimes are also increasing daily using suspicious and malicious URLs, which have
a significant impact on the quality of services provided by the Internet and industrial companies. Currently,
privacy and confidentiality are essential issues on the internet. To breach the security phases and interrupt
strong networks, attackers use phishing emails or URLs that are very easy and effective for intrusion into
private or confidential networks. Phishing URLs simply act as legitimate URLs. A machine-learning-based
phishing system is proposed in this study. A dataset consisting of 32 URL attributes and more than 11054
URLs was extracted from 11000+websites. This dataset was extracted from the Kaggle repository and used as
a benchmark for research. This dataset has already been presented in the form of vectors used in machine
learning models. Decision tree, linear regression, random forest, support vector machine, gradient boosting
machine, K-Neighbor classifier, naive Bayes, and hybrid (LR+SVC+DT) with soft and hard voting were applied
to perform the experiments and achieve the highest performance results. The canopy feature selection with
cross fold validation and Grid search hyper parameter optimization techniques are used with LSD Ensemble
model. The proposed approach is evaluated in this study by experimenting with a separate machine learning
models, and then further evaluation of the study was carried out. The proposed approach successfully
achieves its aim with effective efficiency. Future phishing detection systems should combine list-based
machine learning-based systems to prevent and detect phishing URLs more efficiently
REFERENCES
• [1] N. Z. Harun, N. Jaffar, and P. S. J. Kassim, ‘‘Physical attributes significant in preserving the social sustainability of
the traditional malay settlement,’’ in Reframing the Vernacular: Politics, Semiotics, and Representation. Springer,
2020, pp. 225–238.
• [2] D. M. Divakaran and A. Oest, ‘‘Phishing detection leveraging machine learning and deep learning: A review,’’
2022, arXiv:2205.07411.
• [3] A. Akanchha, ‘‘Exploring a robust machine learning classifier for detecting phishing domains using SSL
certificates,’’ Fac. Comput. Sci., Dalhousie Univ., Halifax, NS, Canada, Tech. Rep. 10222/78875, 2020.
• [4] H. Shahriar and S. Nimmagadda, ‘‘Network intrusion detection for TCP/IP packets with machine learning
techniques,’’ in Machine Intelligence and Big Data Analytics for Cybersecurity Applications. Cham, Switzerland:
Springer, 2020, pp. 231–247.
• [5] J. Kline, E. Oakes, and P. Barford, ‘‘A URL-based analysis of WWW structure and dynamics,’’ in Proc. Netw. Traffic
Meas. Anal. Conf. (TMA), Jun. 2019, p. 800.
• [6] A. K. Murthy and Suresha, ‘‘XML URL classification based on their semantic structure orientation for web mining
applications,’’ Proc. Comput. Sci., vol. 46, pp. 143–150, Jan. 2015.
• [7] A. A. Ubing, S. Kamilia, A. Abdullah, N. Jhanjhi, and M. Supramaniam, ‘‘Phishing website detection: An improved
accuracy through feature selection and ensemble learning,’’ Int. J. Adv. Comput. Sci. Appl., vol. 10, no. 1, pp. 252–
257, 2019.
• [8] A. Aggarwal, A. Rajadesingan, and P. Kumaraguru, ‘‘PhishAri: Automatic realtime phishing detection on Twitter,’’
in Proc. eCrime Res. Summit, Oct. 2012, pp. 1–12.
THANK YOU

• To get this project Visit www.nexgenproject.com


• Email: [email protected]
• FOR IEEE PROJECTS AT LOW COST CONTACT +91
9791938249
• NEXGEN TECHNOLOGY, India

You might also like