0% found this document useful (0 votes)
5 views

Machine_Learning_Based_Malicious_URL_IP_amp_File_Classification

This document discusses the use of machine learning algorithms for the classification of malicious URLs, IP addresses, and files, highlighting their significance in cybersecurity. It explores various techniques and algorithms that can identify and mitigate cyber threats, emphasizing the challenges faced in accurately classifying legitimate from malicious entities. The study also reviews existing literature and suggests improvements for enhancing the effectiveness of machine learning systems in cybersecurity applications.

Uploaded by

yaskalai1602
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Machine_Learning_Based_Malicious_URL_IP_amp_File_Classification

This document discusses the use of machine learning algorithms for the classification of malicious URLs, IP addresses, and files, highlighting their significance in cybersecurity. It explores various techniques and algorithms that can identify and mitigate cyber threats, emphasizing the challenges faced in accurately classifying legitimate from malicious entities. The study also reviews existing literature and suggests improvements for enhancing the effectiveness of machine learning systems in cybersecurity applications.

Uploaded by

yaskalai1602
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

IEEE - 56998

Machine Learning Based Malicious URL, IP & File


2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT) | 979-8-3503-3509-5/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICCCNT56998.2023.10306774

Classification
Chirag Chandrashekar Akshat Keshav Dhamale Praveen Joe I R
School of Computer Science & Engineering, School of Computer Science & Engineering, School of Computer Science & Engineering,
Vellore Institute of Technology Vellore Institute of Technology Vellore Institute of Technology
Chennai, India Chennai, India Chennai, India
chirag.chandrashekar [email protected] [email protected] [email protected]

Abstract— Machine learning algorithms to identify and classified using machine learning techniques, and the file
categorize cyber hazards have garnered attention recently. classification issue in general. In this article, we'll take a look at
Malicious URLs, IP addresses, and files provide the greatest the many methods, strategies, and algorithms that might be
cybersecurity risks. This abstract describes how machine utilised to identify and neutralise potential dangers. Machine
learning-based classification of dangerous URLs, IP addresses,
and files identifies and mitigates threats. The article begins with
learning's benefits and drawbacks in cybersecurity will be
hazardous URL categorization, which uses machine learning to covered, and suggestions for enhancing the efficacy of machine
identify if a URL is risky depending on its attributes. URL learning-based systems will be offered.
classification is challenging since there are many legitimate URLs. A. Malicious URL Classification
Researchers have developed many machine learning algorithms Uniform Resource Locators, or URLs, are the addresses used
that assess URL domain names, length, and keywords to identify to locate resources on the World Wide Web. There is a serious
fraudulent URLs. Second, machine learning-based IP address cybersecurity risk posed by malicious URLs, which may be
classification is discussed. Classifying IP addresses helps identify exploited to spread malware or steal sensitive information. This
and combat DDoS and botnet attacks. To discover rogue IPs,
is where machine learning comes in; the sheer number of
machine learning techniques like clustering, classification, and
genuine URLs makes it difficult to tell them from from
regression examine IP reputation, geolocation, and traffic
patterns. File categorization—using machine learning to identify fraudulent ones. URLs may be flagged as potentially malicious
and categorize potentially hazardous files—concludes the study. If with the use of machine learning algorithms by analysing factors
users open harmful files, viruses, Trojans, and other malware may like domain name, length, and the presence of specified
infect their systems and steal personal data. Using file type, size, keywords. Common techniques for feature-based URL
and behavior, supervised and unsupervised machine learning classification include decision trees, neural networks, and
systems may identify dangerous files. This study discusses support vector machines (SVMs), to name a few. These
machine learning methods for detecting and banning risky algorithms may learn from a collection of annotated URLs to
websites and files. Machine learning algorithms can detect and recognise characteristics and patterns characteristic of malicious
mitigate cyber risks through analysis and categorization. These URLs. Once the model is trained, it may be used to determine if
algorithms have false positives and negatives, overlook dangers, a newly encountered URL poses a security risk.
and require a lot of training data. Thus, machine learning must be
employed with intrusion detection systems and firewalls to guard B. IP Address Classification
against cyberattacks. Each computer or other device that accesses the internet has
Keywords—Malicious file detection, feature extraction, Machine what is called an IP address, or Internet Protocol address.
learning, Gradient Descent, XGBoost Malicious IPs provide a serious cybersecurity risk because they
I. INTRODUCTION may be used in a wide variety of attacks, including distributed
The use of machine learning to identify and counteract cyber denial of service (DDoS) and botnet assaults. IP attributes, such
risks including malicious URLs, IP addresses, and files is as IP reputation, geolocation, and traffic patterns, may be
growing in popularity. As the repercussions of such attacks may analysed by machine learning algorithms to reveal potentially
vary from stolen data to compromised systems, they constitute a dangerous IPs. Clustering techniques, for instance, may classify
serious danger to persons, businesses, and networks. There has IPs into groups with similar characteristics and spot outliers that
been a rise in the use of machine learning algorithms to increase point to malicious intent. IP addresses may be categorised using
the speed and accuracy with which threats are identified. The criteria like reputation, geolocation, and traffic patterns with the
major objective of this work is to survey machine learning-based use of classification algorithms. The chance of an IP being
methods for identifying malicious URLs, IPs, and files. These malicious may be predicted using regression algorithms using
are the themes that will be explored in the paper: previous data.
The challenge of categorising harmful URLs and how C. File Classification
machine learning algorithms might be used to spot suspicious Threats like viruses, Trojans, and other malicious software
links based on their properties. The challenge of categorising IP may infect users' computers and steal personal information when
addresses, and how to utilise machine learning techniques to they open harmful files. In order to identify and categorise
detect rogue IPs. How dangerous files may be detected and harmful files, machine learning algorithms may examine

14the ICCCNT IEEE Conference


Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 25,2024 at 06:33:30 UTC from IEEE Xplore. Restrictions apply.
July 6-8, 2023
IIT - Delhi, Delhi, India
IEEE - 56998

information such as file type, size, and behaviour. For instance, Research [7] suggests using such methods to categorise
harmful file patterns and characteristics may be identified using malicious URLs. On a dataset consisting of 10,000 malicious
supervised learning algorithms by studying a labelled dataset. In and 10,000 benign URLs, the authors test the efficacy of several
order to spot abnormalities and identify possibly dangerous files, deep learning models, including a convolutional neural network
unsupervised learning systems may examine file behaviour. (CNN), a long short-term memory (LSTM) network, and a
These algorithms may also be used to categorise files according hybrid model that combines CNN and LSTM. Using deep
to characteristics like format, size, and behaviour. learning models, the study [8] proposes a new method for doing
so. Convolutional Neural Networks (CNNs), Long Short-Term
II. LITERATURE REVIEW Memory (LSTMs), and Attention Mechanisms are only some of
A machine learning-based strategy for identifying malicious the deep learning methods proposed by the authors in their three-
URLs is proposed in the study [1]. Cybercriminals often use stage categorization procedure. The first step involves taking
malicious URLs in their scamming, virus spreading, and raw network traffic data and running it through a CNN-based
spamming endeavours. In order to train machine learning model to pull out features. In the second phase, we use a long
models like Support Vector Machines (SVM) and Random short-term memory (LSTM) model to learn about the
Forests, the suggested technique utilises lexical data collected correlations and timings between packets in the network. The
from URLs. These features include things like length, entropy, study [9] details a deep learning-based strategy for identifying
and character frequencies. (RF). Experiments on a dataset of malware traffic on networked computers without requiring
URLs demonstrate that the suggested technique successfully specialised expertise. In order to extract characteristics from raw
detects malicious URLs with a high F1 score (99.3% for SVM network traffic data and categorise it as benign or malicious, the
and 98.9% for RF). The research found that the suggested authors suggest a system dubbed DeepInTheDark, which blends
technique was useful in identifying dangerous URLs in real- Convolutional Neural Networks (CNN) with Long Short-Term
world circumstances, suggesting that it may be utilised as part Memory (LSTM) models.
of a larger security system to safeguard consumers against cyber
threats. The research article [10] summarises the state-of-the-art
methods for classifying Internet traffic using machine learning.
An online harmful URL and DNS detection strategy using The authors begin with a discussion of the difficulties and
deep learning is proposed in the article [2]. The suggested significance of Internet traffic classification before providing an
method employs a multi-layered deep neural network to analyse overview of the various traffic classification strategies,
the characteristics of URLs and DNS queries, such as domain including port-based, payload-based, and behaviour-based
length, entropy, and frequency of characters, and then assigns a methods. A unique method for identifying malicious URLs
threat score to each. In order to train and assess the proposed using deep learning models is proposed in the research article
method, the authors additionally create a dataset including more [11]. The authors offer DURLD, a system that employs
than 12,000 URLs and DNS queries. Do et al. present a machine Convolutional Neural Networks (CNN) and Long Short-Term
learning-based method for identifying malicious URLs in their Memory (LSTM) models to learn character-level
research article [3]. The URL's length, age, character representations of URLs and label them as harmful or safe. The
frequencies, and entropy are only some of the characteristics authors test their methodology on a massive dataset including
extracted using the suggested technique of feature extraction. over 1.8 million URLs, some of which are dangerous and others
The authors also assess several machine learning techniques for of which are safe to visit. The experimental findings show that
determining whether or not a URL is malicious, such as decision DURLD exceeds the current state-of-the-art methods and
trees, K-nearest neighbors, and random forests. achieves detection accuracies of up to 99.7%.
In their study [4], a method for identifying malicious URLs A machine learning-based strategy for categorising assaults
that is based on machine learning is presented. Features, such as in IDSs is proposed in the research article [12]. Using variables
domain age, length, and character frequencies, are extracted such protocol type, source IP address, and destination IP
from the URL using the suggested technique. The authors also address, the authors offer a system that employs a Random
assess several machine learning techniques for determining Forest (RF) machine learning method to categorise assaults. The
whether or not a URL is malicious, such as decision trees, K- NSL-KDD dataset is a gold standard for testing intrusion
nearest neighbors, support vector machines, and random forests. detection systems, and the authors utilise it to assess their
An empirical investigation of the efficacy of machine learning suggested method. The testing findings show that their
algorithms for harmful URL identification is presented in the suggested method achieves up to 99% accuracy in classification,
publication [5]. The authors use a number of machine learning which is much better than the state-of-the-art methods.
methods (decision trees, K-nearest neighbors, support vector Characteristics of various assaults are revealed via the paper's
machines, and random forests) to categorise URLs as harmful or in-depth investigation of the most relevant criteria for
benign by extracting data such as length, domain age, and categorization. DoS, R2L, and U2R assaults are only few of the
character frequencies from the URL. examples of which the authors demonstrate the effectiveness of
their suggested technique. The authors also address the
Study [6] offers a deep learning-based method for
shortcomings of their strategy and provide suggestions for
identifying malicious URLs .The authors use a convolutional
further enhancing intrusion detection systems using machine
neural network (CNN) to learn a compact representation of the
learning. In conclusion, the study offers a viable method for
URL, which is then used to determine whether or not the URL
improving the security of computer systems and networks by
is dangerous. The authors additionally test URLNet on two
categorising various forms of assault in intrusion detection
datasets (Phishtank and Alexa) and assess its effectiveness.

14the ICCCNT IEEE Conference


Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 25,2024 at 06:33:30 UTC from IEEE Xplore. Restrictions apply.
July 6-8, 2023
IIT - Delhi, Delhi, India
IEEE - 56998

systems through the use of machine learning techniques. The can statically analyse the properties of executable files to
method may be used for additional purposes, such as cyber determine whether or not they are harmful. In order to identify
threat intelligence and network traffic analysis. executable files, the system first applies a machine learning
algorithm based on a collection of characteristics extracted from
For identifying malicious domain names, the study [13]
the files. The technique of feature extraction and the machine
presents an EML method. Using characteristics including
learning algorithm used by Hidost are both described in depth in
length, entropy, and character n-grams, the authors offer a
the article.
system that use the EML algorithm to label domain names as
harmful or safe. Over a hundred thousand domain names, both In [17], the authors have introduced Opem, a novel malware
malicious and benign, are used to assess the authors' suggested detection method that combines static and dynamic analysis.
technique. The experimental findings show that their method Machine learning methods are used to analyse both static data
beats other state-of-the-art methods and achieves a high (like headers and import tables) and dynamic aspects (like
detection accuracy of up to 99.5%. Insights into the system calls and memory access patterns) of executable files.
characteristics of malicious domain names are provided by the Opem's architecture is broken down into its three primary parts,
paper's extensive investigation of the variables that are most which are a static analysis, a dynamic analyzer, and a machine
relevant for categorization. The authors demonstrate the learning module, as explained by the authors. The executable
effectiveness of their suggested method in identifying a wide files are analysed by a static analyzer, which extracts a set of
range of bad domain names, including those used for phishing, characteristics, and a dynamic analyzer, which tracks the file's
malware hosting, and botnet command and control. The authors execution and records its behaviour. The machine learning
also address the weaknesses of their solution and provide component then makes use of the aggregated collection of
suggestions on how to further EML-based domain name characteristics to determine if the file in question is harmful or
detection in the future. Overall, the study proposes a potential not. Traditional machine learning algorithms like decision trees,
method for identifying harmful domain names utilising extreme support vector machines, and neural networks are discussed, as
machine learning methods, which may contribute to better are more recent deep learning techniques like convolutional
network and system security. The method may be used for a neural networks and recurrent neural networks for malware
wide variety of purposes, including spam filtering and online detection. The author [18] compares the effectiveness of several
content screening. methods on a variety of datasets and analyses the benefits and
drawbacks of each. Also discussed are the static analysis
An IP reputation-based machine learning technique to
features, dynamic analysis features, and hybrid characteristics
malware detection is proposed in the research article [14]. The
utilised in malware detection using machine learning. The
authors offer a system that can identify malicious IP addresses
author elaborates on how to achieve high accuracy in malware
in a network by analysing traffic data and using machine
detection via feature selection, feature engineering, and the use
learning methods. A dataset including more than 10,000 IP
of machine learning algorithms. Finally, the study discusses the
addresses, including malicious and benign IP addresses, is used
remaining research questions and makes suggestions for future
to assess the authors' suggested technique. The experimental studies in the area of machine learning-based malware detection.
findings show that their method beats other state-of-the-art Researchers, practitioners, and policymakers in the area of
methods and achieves high detection accuracy of up to 98%. In cybersecurity may find this study's thorough analysis of the
addition to discussing the characteristics of malicious IP present state-of-the-art in machine learning-based malware
addresses, the study gives a comprehensive analysis of the detection and categorization helpful.
criteria that are most relevant for categorization. The authors
demonstrate that their suggested method is capable of The authors provide [19] a method for malware
identifying several forms of malware, including Trojans, classification that combines convolutional neural networks
viruses, and worms. The authors also address the shortcomings (CNNs) with decision trees. The system automatically pulls
of their solution and provide suggestions on how to further characteristics from files to serve as inputs to the CNN, and it is
machine learning-based malware detection in the future. Using trained on a huge dataset of malware samples and benign files.
IP reputation and machine learning techniques to identify A decision tree is fed the CNN's output to determine whether or
malware is a potential strategy presented in this research study not the file is harmful. The system architecture, feature
that may improve the safety of online infrastructure. The method extraction procedure, and training and testing technique
may be used for a wide variety of purposes, including intrusion employed in the research are all described in depth in the
detection and threat intelligence. publication. The system is tested on many datasets, and its
performance is compared to that of other cutting-edge, machine
Research [15] provides an in-depth analysis of the several learning-based malware detection systems. According to the
methods that use machine learning to identify malicious code in findings, the suggested system is able to classify malware with
executable files. This study surveys the present state of the art in a high degree of precision, even for users who are not specialists
malware detection, discussing the many forms of malware, their in the field.
distinguishing features, and the obstacles that must be
overcome. Traditional machine learning methods, deep learning The authors [20] suggest a system that combines a mix of
techniques, and hybrid approaches are all discussed, as are their static and dynamic analytic features to extract data from
respective merits in the context of malware detection. An malware samples, and then uses a feature selection procedure to
innovative method for identifying malicious files using static determine which characteristics are most useful for the
analysis and machine learning approaches is presented. [16] The classification job. To classify the data, the chosen characteristics
authors present Hidost, a machine learning-based system that are fed into a number of machine learning techniques such

14the ICCCNT IEEE Conference


Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 25,2024 at 06:33:30 UTC from IEEE Xplore. Restrictions apply.
July 6-8, 2023
IIT - Delhi, Delhi, India
IEEE - 56998

support vector machines (SVM), decision trees, and random presence of special characters, presence of certain
forests. The report gives a comprehensive breakdown of the keywords, etc.
study's training and testing procedures, feature extraction and
selection methods, and system architecture. The system is tested  Data Splitting: Split the preprocessed data into training
on many datasets, and its performance is compared to that of and testing sets. The training set will be used to train the
other cutting-edge, machine learning-based malware detection Gradient Boosting model, and the testing set will be used
systems. The authors present a technique that use many to evaluate the model's performance.
simultaneous machine learning [21] classifiers to speed up the  Model Training: Train Gradient Boosting model on the
detection process while maintaining high accuracy. In order to training set. Hyperparameters are tuned to optimize the
train machine learning classifiers, the system pulls features from performance. Optimized hyperparameters can be found
Android apps. The system uses both static and dynamic analytic using gridsearch or randomsearch.
tools to detect and prevent malware. The system architecture,
feature extraction procedure, and training and testing technique  Model Evaluation: Evaluate the performance of the
employed in the research are all described in depth in the model on the testing set using metrics like accuracy,
publication. precision, recall, F1-score, ROC curve, etc. Cross-
validation can also be used to validate the model's
In order to extract useful data from Android apps [22], the performance.
authors present a method that use both static and dynamic
analytic features, followed by a feature selection procedure to  Model Deployment: Once satisfied with the model's
determine which characteristics are most useful for the performance, we can deploy it in a production
classification job at hand. To classify the data, the chosen environment where it can be used to detect malicious
characteristics are fed into a number of machine learning URLs in real-time.
techniques such support vector machines (SVM), decision trees,
 Model Monitoring: Monitor the model's performance in
and random forests. The report gives a comprehensive
production to ensure that it is still accurate and effective.
breakdown of the study's training and testing procedures, feature
You may need to retrain the model periodically to keep
extraction and selection methods, and system architecture. The
it up-to-date with the latest threats
system is compared to other cutting-edge, machine learning-
based. The authors [23] suggest a method that does this by B. IP
looking at the order in which Android apps make system calls Similar to URL detection, in order to detect whether the
and gleaning useful information from it. Machine learning provided IP address is malicious or not, the proposed model
methods such as support vector machines (SVM), decision trees, uses a machine technique. Here XGBosst algorithm is used
and random forests use the extracted characteristics as input for and trained on a dataset. IP detection first involves IP
categorization. extraction where the IP provided is broken down into 4
III. METHODOLOGY different numbers so that the trained model could predict the
class to which the given IP belongs. Once IP features are
Working of this model is explained in three different classes extracted, it is fed to a XGBoost Model. Its working can be
of mainly URL, IP and File Along with explanation about its described as :
user interface.
 Ensemble Method: XGBoost is an ensemble method
A. URL that combines multiple weak learners to form a strong
In order to detect whether the provided URL is malicious or learner. Specifically, it creates an ensemble of decision
not, the proposed system uses a gradient boost technique trees, where each tree tries to correct the mistakes made
which is already trained on a Kaggle dataset. The working by the previous tree.
of URL detection using this trained model contains multiple
steps some of which are as follows:  Boosting: XGBoost uses a boosting technique to
improve the accuracy of the model. In boosting, the
 Data Collection: Here we have collected a large dataset model learns iteratively by adding new trees to the
of URLs that includes both legitimate and malicious ensemble, where each new tree is trained on the residual
URLs from kaggle. This could also have been obtained terrors of the previous tree. This process continues until
by web scraping, using public datasets or from third- the desired level of accuracy is achieved.
party sources.
 Gradient Descent: The key innovation in XGBoost is
 Data Preprocessing: Once data is collected, we need to the use of gradient descent optimization to improve the
preprocess it by cleaning the data and removing any accuracy of the model. Gradient descent is an
duplicates or irrelevant URLs. We have also extracted optimization technique that iteratively adjusts the model
features from the URLs that can be used for machine parameters to minimize the loss function.
learning.
 Regularization: XGBoost uses regularization
 Feature Extraction: Extracting features from the URLs techniques to prevent overfitting, which is a common
is a critical step in developing a machine learning model problem in machine learning. Regularization techniques
for URL detection. Some of the features which we have such as L1 and L2 regularization are used to add a
extracted include domain age, length of the URL,

14the ICCCNT IEEE Conference


Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 25,2024 at 06:33:30 UTC from IEEE Xplore. Restrictions apply.
July 6-8, 2023
IIT - Delhi, Delhi, India
IEEE - 56998

penalty term to the loss function, which helps to simplify D. Model Architecture
the model and reduce overfitting.
 Tree Pruning: XGBoost uses tree pruning techniques to
improve the efficiency of the model. Tree pruning
involves removing nodes and branches from the decision
trees that do not contribute significantly to the accuracy
of the model. This helps to reduce the complexity of the
model and improve its efficiency.
 Feature Importance: XGBoost provides a feature
importance score that indicates the relative importance of
each feature in predicting the target variable. This can be
useful for understanding which features are most
important for the model and for feature selection.
The results obtained from the trained model are compared
with other various scans provided by the virus total API.
C. File
For detecting whether a file is malicious or not, the proposed
system uses an API from virus total, which works on the
following principles :
 Virustotal File Upload: The first step is to upload the
file to Virustotal. Virustotal will analyse the file Fig. 1. Proposed System Flow diagram
automatically using various antivirus engines and other
malware detection technologies. Whole URL, IP and File detection model is implemented in
Python using tkinter to create a GUI. End user can navigate
 Analyze findings: After analysing the file, Virustotal between URL, IP and File detection tabs. Once given input,
will provide a report that includes the detection findings the user will get basic statistics about URL, IP or File and a
from each antivirus engine and additional detection “More details” button will appear at the bottom. Once
technologies. Based on the detection findings, the report clicked, it will open a dashboard which provides more details
will tell if the file is malicious or not. about the given. Dashboard is made using flask which is a
python based library. React is also used to implement the
 Examine the Detection Ratio: This is the number of
frontend interface which shows following details :
antivirus engines and detection technologies that identify
the file as harmful. A high detection ratio suggests that
the file is harmful, whereas a low detection ratio shows
that the file is innocent.
Virustotal gives extra information about the file, such as its
behavior, file type, and file size, in addition to the detection
findings. This data can help determine the nature of the file
and its possible impact.
 Determine Action: The next step is to decide what
action to take based on the detection findings and
additional information. If the file is detected as malicious
by a high number of antivirus engines, it is likely that the
file is indeed malicious and should be deleted or
quarantined. However, if the detection ratio is low or
Fig. 2. Methodological Pipeline
there are conflicting results, further investigation may be
needed before taking action.
 Trend over last year : A line plot showing similar file
 Monitor: Even after taking action, it is important to signatures scanned over last year. Values shows scans in
continue monitoring the system for any signs of hundreds on Virustotal.
malicious activity. Virustotal may be used to
 Countries reporting similar malware : Line plot
continuously monitor the file or URL to check if fresh
showing country origination trends of similar file
detection results or further information become
signatures on VirusTotal.
available.
 Antivirus scan for the same category : Bar plot
showing some similar file signatures detected on
different antivirus platforms over last year in hundreds.

14the ICCCNT IEEE Conference


Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 25,2024 at 06:33:30 UTC from IEEE Xplore. Restrictions apply.
July 6-8, 2023
IIT - Delhi, Delhi, India
IEEE - 56998

 Efficiency : Efficiency of tool scanned for File, URL and IV. RESULT
IP updated upon scanning of each input. A. Classifying Malicious URL
 Antivirus Details : Detailed table showing all anti-virus The program is initiated with a tkinter GUI where the user is
scans along with its response details and malicious given tabs to select between URL, IP and File Detection. Once
indications. the user clicks on URL Detection, Fig. 3 will be opened.
E. Algorithm
Algorithm : Malicious URL, IP or File Classification
Input : URL, IP or File to check whether it is malicious or
not.
Output : Mapping the given user input to the result obtained
from the proposed system with a detailed analysis.
1.BEGIN
2. Select the type of input that is if the input is URL, IP, or
file
3. If ( input == URL)
3.1. Enter the URL
3.2. URL segmentation is performed in order to segment
the URL Fig. 3. Tkinter GUI of URL Detection
3.3. This segmentation is uploaded to a pre-trained
model which is the gradient boost algorithm Once Input given in the URL field box, the User can click
on Check. After completing the process of scanning, it will show
3.4. the result obtained from the model is displayed basic results along with “More details tab”, Once clicked it will
3.5. Analysis is performed by passing the same URL to open Dashboard Fig. 4-5,7-8.
multiple antiviruses using the virus total API and the
results obtained from the scan are compared with the
proposed model.
3.6. End if
4. If (input == IP address)
4.1. Enter the IP address
4.2. IP segmentation is performed in order to segment the
provided IP address.
4.3. this segmentation is uploaded to a pre-trained model
which is the XGBoost algorithm.
4.4 the result obtained from the model is displayed
4.5 Analysis is performed by passing the same IP address
to multiple antiviruses using the virus total API and Fig. 4. Malware detection rate per month of a malicious UR
the results obtained from the scan are compared with
the proposed model.
4.6. End if
5. If (input == File)
5.1. Enter the file location where the file is present
5.2. the file present in the uploaded location is scanned
and sent to the main server using a virus total API
5.3. in the server the file is checked for various
characteristics like file reputation, similarity, and
many other features
5.4 the result obtained from the model is displayed
5.5 Analysis is performed by passing the same IP address
to multiple antiviruses using the virus total API and
the results obtained from the scan are compared with Fig. 5. Similar sample detection rate on different antivirus of a malicious
the proposed model URL
5.6. End if
6. Display all the statistical data obtained from various
scans and proposed system
7. END

14the ICCCNT IEEE Conference


Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 25,2024 at 06:33:30 UTC from IEEE Xplore. Restrictions apply.
July 6-8, 2023
IIT - Delhi, Delhi, India
IEEE - 56998

B. Classifying Malicious IP Antivirus detected result version


User can click on IP Detection tab in tkinter GUI to perform McAfee True Artemis!C76926FE2E63 6.0
IP check (Fig. 5).
Arcabit True Trojan.Cerbu.D28E12 0.0.18

BitDefender True Gen:Variant.Cerbu.167442 7.2

Trapmine False None 4.14

MAX True malware (ai score=83) 1.1.14

Emsisoft True Gen:Variant.Cerbu.167442 6.0

Avast True Win64:PWSX-gen [Trj] 11.7


Table 2 : Tabular responsive results of malicious file

Once the user clicks on “More details” button, a dashboard


will appear which shows detailed analysis of that file (Fig. 9).

Fig. 6. Tkinter GUI of IP Detection

After getting the basic details of IP once clicked on check,


the user will be prompted a button which opens a detailed
dashboard analysis of that IP (Fig.6,7 ).
Fig. 7. Malware detection rates of six countries of a malicious URL

Fig. 9. Tkinter GUI of File Detection

The Dashboard provides introspective data about the input


given. These details are essential for any organization/individual
to assess the maliciousness of the given file/IP/URL. The Trends
helps to determine the likelihood of its further spread and
infecting rate. Moreover, tabular details obtained from different
antivirus shows detection result and malicious indication name
and its details. Higher the number of antiviruses flagging the
input, higher the maliciousness level of it will be and hence
potentially more dangerous.
Fig. 8. Dashboard for IP Detction
V. CONCLUSION
Antivirus detected result Detail
To sum up, machine learning algorithms have shown to be
http://yandex.com/infected?l1
Yandex
False
Clean
0n=en&url=http://www.googl an effective tool in detecting and mitigating cybersecurity risks,
safebrowsing site especially in the areas of recognising and categorising harmful
e.com/
Clean URLs, IPs, and files. There is a need need for an efficient and
Phishtank False -
site reliable detection system, since these threats represent serious
Clean dangers to persons and businesses alike. The complex and ever-
OpenPhish False -
site
Clean
changing nature of cybersecurity is perfectly suited to the
Snort False
site
- pattern-finding and data-analysis capabilities of machine
Clean learning algorithms. Successful classification of malicious
Bkav False -
site URLs based on factors such as domain name, length, and the
Table 1 : Tabular response results of benign URL inclusion of specified keywords has been achieved using
algorithms such as decision trees, neural networks, and support
C. Classifying Malicious Files vector machines. In a similar vein, hostile IPs may be identified
User also has the option to upload a file and scan it for any and classified by machine learning algorithms by analysing their
malicious indications of File detection Tab. Once uploaded, it reputation, geolocation, and traffic patterns. In order to improve
will show rudimentary details (Fig. 8). the accuracy and speed of threat detection, algorithms for
clustering, classification, and regression are often employed to
Antivirus detected result version
find patterns and traits that are suggestive of harmful behaviour.
Lionic True Trojan.Win32.Tedy.4!c 7.5

14the ICCCNT IEEE Conference


Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 25,2024 at 06:33:30 UTC from IEEE Xplore. Restrictions apply.
July 6-8, 2023
IIT - Delhi, Delhi, India
IEEE - 56998

Threats such as malware, which may infect users' computers [13] Shi, Y., Chen, G., & Li, J. "Malicious domain name detection based on
and steal personal data, are often hidden inside malicious files. extreme machine learning." Neural Processing Letters 48 (2018): 1347-
1357.
To identify and categorise harmful files, machine learning
[14] Usman, N., Usman, S., Khan, F., Jan, M. A., Sajid, A., Alazab, M., &
algorithms may examine file behaviour and attributes such as Watters, P. "Intelligent dynamic malware detection using machine
file type, size, and behaviour. These algorithms may detect learning in IP reputation for forensics data analytics." Future Generation
malicious files more accurately since they have learned to Computer Systems 118 (2021): 124-141.
recognise patterns and attributes characteristic of such files from [15] Singh, Jagsir, and Jaswinder Singh. "A survey on machine learning-based
a labelled dataset. The benefits of machine learning algorithms malware detection in executable files." Journal of Systems
must be balanced against their drawbacks, such as the need for Architecture 112 (2021): 101861.
huge quantities of training data and the possibility of false [16] Šrndić, Nedim, and Pavel Laskov. "Hidost: a static machine-learning-
based detector of malicious files." EURASIP Journal on Information
positives and false negatives. Future progress in threat Security 2016 (2016): 1-20.
identification and mitigation will depend on addressing these
[17] Santos, I., Devesa, J., Brezo, F., Nieves, J., & Bringas, P. G. "Opem: A
constraints and refining machine learning-based techniques. static-dynamic approach for machine-learning-based malware
detection." International joint conference CISIS’12-ICEUTE´ 12-SOCO´
In conclusion, harmful URL, IP, and file categorization 12 special sessions. Springer Berlin Heidelberg, 2013.
using machine learning has the potential to increase detection [18] Chumachenko, Kateryna. "Machine learning methods for malware
accuracy and response time. Despite these caveats, machine detection and classification." (2017).
learning is becoming an increasingly useful weapon in the battle [19] Le, Q., Boydell, O., Mac Namee, B., & Scanlon, M. "Deep learning at the
against cybersecurity threats as new algorithms and methods are shallow end: Malware classification for non-domain experts." Digital
developed and refined. Investigation 26 (2018): S118-S126.
[20] Liu, L., Wang, B. S., Yu, B., & Zhong, Q. X. "Automatic malware
REFERENCES classification and new malware detection using machine
[1] Raja, A. S., Vinodini, R., & Kavitha, A. "Lexical features based malicious learning." Frontiers of Information Technology & Electronic
URL detection using machine learning techniques." Materials Today: Engineering 18.9 (2017): 1336-1347.
Proceedings 47 (2021): 163-166. [21] Yerima, Suleiman Y., Sakir Sezer, and Igor Muttik. "Android malware
[2] Jiang, J., Chen, J., Choo, K. K. R., Liu, C., Liu, K., Yu, M., & Wang, detection using parallel machine learning classifiers." 2014 Eighth
Y. "A deep learning based online malicious URL and DNS detection international conference on next generation mobile apps, services and
scheme." Security and Privacy in Communication Networks: 13th technologies. IEEE, 2014.
International Conference, SecureComm 2017, Niagara Falls, ON, [22] Milosevic, Nikola, Ali Dehghantanha, and Kim-Kwang Raymond Choo.
Canada, October 22–25, 2017, Proceedings 13. Springer International "Machine learning aided Android malware classification." Computers &
Publishing, 2018. Electrical Engineering 61 (2017): 266-274.
[3] Do Xuan, C., Nguyen, H. D., & Tisenko, V. N. "Malicious URL detection [23] Vinod, P., Akka Zemmari, and Mauro Conti. "A machine learning based
based on machine learning." International Journal of Advanced Computer approach to detect malicious android apps using discriminant system
Science and Applications 11.1 (2020). calls." Future Generation Computer Systems 94 (2019): 333-350.
[4] Catak, F. O., Sahinbas, K., & Dörtkardeş, V. "Malicious url detection
using machine learning." Artificial intelligence paradigms for smart
cyber-physical systems. IGI global, 2021. 160-180.
[5] Patgiri, R., Katari, H., Kumar, R., & Sharma, D."Empirical study on
malicious URL detection using machine learning." Distributed
Computing and Internet Technology: 15th International Conference,
ICDCIT 2019, Bhubaneswar, India, January 10–13, 2019, Proceedings
15. Springer International Publishing, 2019.
[6] Le, H., Pham, Q., Sahoo, D., & Hoi, S. C. "URLNet: Learning a URL
representation with deep learning for malicious URL detection." arXiv
preprint arXiv:1802.03162 (2018).
[7] Vinayakumar, R., Soman, K. P., & Poornachandran, P. "Evaluating deep
learning approaches to characterize and classify malicious
URL’s." Journal of Intelligent & Fuzzy Systems 34.3 (2018): 1333-1343.
[8] Marín, G., Caasas, P., & Capdehourat, G. "Deepmal-deep learning models
for malware traffic detection and classification." Data Science–Analytics
and Applications: Proceedings of the 3rd International Data Science
Conference–iDSC2020. Springer Fachmedien Wiesbaden, 2021.
[9] Marín, G., Casas, P., & Capdehourat, G."Deep in the dark-deep learning-
based malware traffic detection without expert knowledge." 2019 IEEE
Security and Privacy Workshops (SPW). IEEE, 2019.
[10] Salman, O., Elhajj, I. H., Kayssi, A., & Chehab, A. "A review on machine
learning–based approaches for Internet traffic classification." Annals of
Telecommunications 75 (2020): 673-710.
[11] Srinivasan, S., Vinayakumar, R., Arunachalam, A., Alazab, M., & Soman,
K. P. "DURLD: Malicious URL detection using deep learning-based
character level representations." Malware analysis using artificial
intelligence and deep learning (2021): 535-554.
[12] Park, K., Song, Y., & Cheong, Y. G. "Classification of attack types for
intrusion detection systems using a machine learning algorithm." 2018
IEEE fourth international conference on big data computing service and
applications (BigDataService). IEEE, 2018.

14the ICCCNT IEEE Conference


Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 25,2024 at 06:33:30 UTC from IEEE Xplore. Restrictions apply.
July 6-8, 2023
IIT - Delhi, Delhi, India

You might also like