Paper 19-Malicious URL Detection Based On Machine Learning
Paper 19-Malicious URL Detection Based On Machine Learning
Abstract—Currently, the risk of network information according to this statistic, the three main URL spreading
insecurity is increasing rapidly in number and level of danger. techniques, which are malicious URLs, botnet URLs, and
The methods mostly used by hackers today is to attack end-to- phishing URLs, increase in number of attacks as well as danger
end technology and exploit human vulnerabilities. These level.
techniques include social engineering, phishing, pharming, etc.
One of the steps in conducting these attacks is to deceive users From the statistics of the increase in the number of
with malicious Uniform Resource Locators (URLs). As a results, malicious URL distributions over the consecutive years, it is
malicious URL detection is of great interest nowadays. There clear that there is a need to study and apply techniques or
have been several scientific studies showing a number of methods methods to detect and prevent these malicious URLs.
to detect malicious URLs based on machine learning and deep
learning techniques. In this paper, we propose a malicious URL Regarding the problem of detecting malicious URLs, there
detection method using machine learning techniques based on are two main trends at present as malicious URL detection
our proposed URL behaviors and attributes. Moreover, bigdata based on signs or sets of rules, and malicious URL detection
technology is also exploited to improve the capability of detection based on behavior analysis techniques [1, 2]. The method of
malicious URLs based on abnormal behaviors. In short, the detecting malicious URLs based on a set of markers or rules
proposed detection system consists of a new set of URLs features can quickly and accurately detect malicious URLs. However,
and behaviors, a machine learning algorithm, and a bigdata this method is not capable of detecting new malicious URLs
technology. The experimental results show that the proposed that are not in the set of predefined signs or rules. The method
URL attributes and behavior can help improve the ability to of detecting malicious URLs based on behavior analysis
detect malicious URL significantly. This is suggested that the techniques adopt machine learning or deep learning algorithms
proposed system may be considered as an optimized and friendly to classify URLs based on their behaviors. In this paper,
used solution for malicious URL detection. machine learning algorithms are utilized to classify URLs
based on their attributes. The paper also includes a new URL
Keywords—URL; malicious URL detection; feature extraction; attribute extraction method.
feature selection; machine learning
In our research, machine learning algorithms are used to
I. INTRODUCTION classify URLs based on the features and behaviors of URLs.
Uniform Resource Locator (URL) is used to refer to The features are extracted from static and dynamic behaviors
resources on the Internet. In [1], Sahoo et al. presented about of URLs and are new to the literature. Those newly proposed
the characteristics and two basic components of the URL as: features are the main contribution of the research. Machine
protocol identifier, which indicates what protocol to use, and learning algorithms are a part of the whole malicious URL
resource name, which specifies the IP address or the domain detection system. Two supervised machine learning algorithms
name where the resource is located. It can be seen that each are used, Support vector machine (SVM) and Random forest
URL has a specific structure and format. Attackers often try to (RF).
change one or more components of the URL's structure to The paper is organized as follows. Section II reviews some
deceive users for spreading their malicious URL. Malicious recent works in the literature on malicious URL detection. The
URLs are known as links that adversely affect users. These proposed malicious URLs detection system using machine
URLs will redirect users to resources or pages on which learning is presented in Section III. In this section, the new
attackers can execute codes on users' computers, redirect users features for URLs detection process are also described in
to unwanted sites, malicious website, or other phishing site, or details. Experimental results and discussions are provided in
malware download. Malicious URLs can also be hidden in Section IV. The paper is concluded by Section V.
download links that are deemed safe and can spread quickly
through file and message sharing in shared networks. Some II. RELATED WORKS
attack techniques that use malicious URLs include [2, 3, 4]:
Drive-by Download, Phishing and Social Engineering, and A. Signature based Malicious URL Detection
Spam. Studies on malicious URL detection using the signature
sets had been investigated and applied long time ago [6, 7, 8].
According to statistics presented in [5], in 2019, the attacks
Most of these studies often use lists of known malicious URLs.
using spreading malicious URL technique are ranked first
Whenever a new URL is accessed, a database query is
among the 10 most common attack techniques. Especially,
148 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 1, 2020
executed. If the URL is blacklisted, it is considered as downloading all the specified sites. and run them in a
malicious, and then, a warning will be generated; otherwise sandbox browser environment.
URLs will be considered as safe. The main disadvantage of this
approach is that it will be very difficult to detect new malicious Some other tools: Among aforementioned typical tools,
URLs that are not in the given list. there are some other URL checking tools, such as
UnShorten.it, VirusTotal, Norton Safe Web,
B. Machine Learning based Malicious URL Detection SiteAdvisor (by McAfee), Sucuri, Browser Defender,
There are three types of machine learning algorithms that Online Link Scan, and Google Safe Browsing
can be applied on malicious URL detection methods, including Diagnostic.
supervised learning, unsupervised learning, and semi- From the analysis and evaluation of malicious URL
supervised learning. And the detection methods are based on detection tools presented above, it is found that the majority of
URL behaviors. current malicious URL detection tools are signature-based
In [1], a number of malicious URL systems based on URL detection systems. Therefore, the effectiveness of these
machine learning algorithms have been investigated. Those tools is limited.
machine learing algorithms include SVM, Logistic Regression,
III. MALICIOUS URL DETECTING USING MACHINE
Nave Bayes, Decision Trees, Ensembles, Online Learning, ect.
In this paper, the two algorithms, RF and SVM, are used. The LEARNING
accuracy of these two algorithms with different parameters A. The Model
setups will be presented in the experimental results. Fig. 1 presents the proposed malicious URL detection
The behaviors and characteristics of URLs can be divided system using machine learning. The malicious URL detection
into two main groups, static and dynamic. In their studies [9, model using machine learning contains two stages: training and
10, 11] authors presented methods of analyzing and extracting detection.
static behavior of URLs, including Lexical, Content, Host, and
Training stage: To detect malicious URLs, it is
Popularity-based. The machine learning algorithms used in
necessary to collect both malicious URLs and clean
these studies are Online Learning algorithms and SVM.
URLs. Then, all the malicious and clean URLs are
Malicious URL detection using dynamic actions of URLs is
correctly labeled and proceeded to attribute extraction.
presented in [12, 13]. In this paper, URL attributes are
These attributes will be the best basis for determining
extracted based on both static and dynamic behaviors. Some
which URLs are clean and which are malicious. Details
attribute groups are investigated, including Character and
of these attributes will be presented in details in this
semantic groups; Abnormal group in websites and Host-based
paper. Finally, this dataset is divided into 2 subsets:
group; Correlated group.
training data used for training machine learning
C. Malicious URL Detection Tools algorithms, and testing data used for testing process. If
URL Void: URL Void is a URL checking program the classification performance of the machine learning
using multiple engines and blacklists of domains. Some model is good (high classification accuracy), the model
examples of URL Void are Google SafeBrowsing, will be used in the detection phase.
Norton SafeWeb and MyWOT. The advantage of the Detection phase: The detection phase is performed on
Void URL tool is its compatibility with many different each input URL. First, the URL will go through
browsers as well as it can support many other testing attribute extraction process. Next, these attributes are
services. The main disadvantage of the Void URL tool input to the classifier to classify whether the URL is
is that the malicious URL detection process relies clean or malicious.
heavily on a given set of signatures.
B. URL Attribute Extraction and Selection
UnMask Parasites: Unmask Parasites is a URL testing In [1], the authors listed some main attribute groups for
tool by downloading provided links, parsing Hypertext malicious URL detection as follows.
Markup Language (HTML) codes, especially external
links, iframes and JavaScript. The advantage of this Lexical features: these features include URL length, main
tool is that it can detect iframe fast and accurately. domain length, maximum token domain length, path average
However, this tool is only useful if the user has length, average token length in domain.
suspected something strange happening on their sites.
Host-based Features: these features are extracted from the
Dr.Web Anti-Virus Link Checker: Dr.Web Anti-Virus host characteristics of the URLs. These attributes indicate the
Link Checker is an add-on for Chrome, Firefox, Opera, location of malicious servers, the identity of malicious servers,
and IE to automatically find and scan malicious content the degree of impact of several host-based features that
on a download link on all social networking links such contribute the URL's malicious level.
as Facebook, Vk.com, Google+.
Content-based Features: these features are acquired when a
Comodo Site Inspector: This is a malware and security whole web page is downloaded. The workload of these features
hole detection tool. This helps users check URLs or is quite heavy, since a lot of information needs to be extracted,
enables webmasters to set up daily checks by and there may be security concerns about accessing that URL.
However, with more information available about a particular
149 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 1, 2020
site, it is expected to create a better prediction model. The for each particular experimental dataset. In this paper, the use
content-based features of a website can be extracted primarily of all three attribute groups is recommended. However, in each
from its HTML content and the use of JavaScript. attribute group some new attributes and characteristics of the
URL to optimize the ability to detect malicious URLs are
Above are the three main attribute groups commonly used proposed. The new attributes for malicious URL detection in
by researchers to detect malicious URLs. However, each study this research are listed in Tables I, II, and III.
has its own decision on suitable attributes and characteristics
Training stage Detection stage
URL URL
Classification
Machine learning
algorithm Training
Safe URL
Malicious URL
150 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 1, 2020
All attributes marked “*” in Tables I, II, III are newly In this research, machine learning algorithms are the last
extracted and selected in this research. Besides, in previous puzzle to complete our proposed malicious URL detection
researches, authors tend to use feature extraction and selection system. Those algorithms are suitable to utilized the usefulness
method based on a group of predefined features. However, of our new features selected for malicious URL detection. The
those recommended features are specialized and not popular. machine learning algorithms are already well investigated in
As a results, it is usually difficult to implement those features the literature. In this work, SVM and RF are selected as an
in other works, and to re-evaluate the detection performance of example to illustrate the good performance of the whole
those features. In this work, we try to combine basic features to detection system, and are not our main focus. Readers are
formulate new ones. encouraged to implement some other algorithms such as Naïve
Bayes, Decision trees, k-nearest neighbors, neural networks,
C. Machine Learning Algorithm Selection etc.
The application of machine learning algorithms in detecting
malicious URLs has been studied and applied widely [1]. In In order to explore the effectiveness of using these two
this paper, two commonly used supervised machine learning algorithms, different adjustments of parameters are
algorithms, RF and SVM [14, 15], are used. implemented.
151 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 1, 2020
Confusion matrix: is a two-way Table IV representing how Classified malicious URL Classified safe URL
many samples are classified into which label accordingly. Real malicious URLs TP FN
Precision: is the percentage of malicious URLs correctly Real safe URLs FP TN
labeled (TP) among all malicious URLs labeled by the
classifier (TP+FP).
152 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 1, 2020
Dataset Algorithm and parameters Accuracy (%) Precision (%) Recall (%) Training time (s) Testing time (s)
SVM (100 iterations) 93.39 94.67 92.51 2.32 0.01
SVM (10 iterations) 93.35 94.84 92.71 3.11 0.01
10.000 URLs
RF (10 trees) 99.10 98.43 97.45 2.78 0.01
RF (100 trees) 99.77 98.75 97.85 3.34 0.01
SVM (100 iterations) 90.70 93.43 88.45 272.97 2.12
SVM (10 iterations) 91.07 93.75 88.85 280.33 2.31
470.000 URLs
RF (10 trees) 95.45 90.21 95.12 372.97 2.02
RF (100 trees) 96.28 91.44 94.42 480.33 2.30
TABLE. VI. TESTING RESULTS [7] C. Seifert, I. Welch, and P. Komisarczuk, “Identification of malicious
web pages with static heuristics,” in Telecommunication Networks and
Predicted safe Applications Conference, 2008. ATNAC 2008. Australasian. IEEE,
Predicted malicious URL
URL 2008, pp. 91–96.
Real safe URL (107) 96 11 [8] S. Sinha, M. Bailey, and F. Jahanian, “Shades of grey: On the
Real malicious URL (118) 9 109 effectiveness of reputation-based “blacklists”,” in Malicious and
Unwanted Software, 2008. MALWARE 2008. 3rd International
Conference on. IEEE, 2008, pp. 57–64.
V. CONCLUSIONS
[9] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Identifying suspicious
In this paper, a method for malicious URL detection using urls: an application of large-scale online learning,” in Proceedings of the
machine learning is presented. The empirical results in 26th Annual International Conference on Machine Learning. ACM,
2009, pp. 681–688.
Tables V and VI have shown the effectiveness of the proposed
extracted attributes. In this study, we do not use special [10] B. Eshete, A. Villafiorita, and K. Weldemariam, “Binspect: Holistic
analysis and detection of malicious web pages,” in Security and Privacy
attributes, nor do we seek to create huge datasets to improve in Communication Networks. Springer, 2013, pp. 149–166.
the accuracy of the system as many other traditional [11] S. Purkait, “Phishing counter measures and their effectiveness– literature
publications. Here, the combination between easy-to-calculate review,” Information Management & Computer Security, vol. 20, no. 5,
attributes and big data processing technologies to ensure the pp. 382–420, 2012.
balance of the two factors is the processing time and accuracy [12] Y. Tao, “Suspicious url and device detection by log mining,” Ph.D.
of the system. The results of this research can be applied and dissertation, Applied Sciences: School of Computing Science, 2014.
implemented in information security technologies in [13] G. Canfora, E. Medvet, F. Mercaldo, and C. A. Visaggio, “Detection of
information security systems. The results of this article have malicious web pages using system calls sequences,” in Availability,
Reliability, and Security in Information Systems. Springer, 2014, pp.
been used to build a free tool [20] to detect malicious URLs on 226–238.
web browsers. [14] Leo Breiman.: Random Forests. Machine Learning 45 (1), pp. 5- 32,
REFERENCES (2001).
[1] D. Sahoo, C. Liu, S.C.H. Hoi, “Malicious URL Detection using Machine [15] Thomas G. Dietterich. Ensemble Methods in Machine Learning.
Learning: A Survey”. CoRR, abs/1701.07179, 2017. International Workshop on Multiple Classifier Systems, pp 1-15,
Cagliari, Italy, 2000.
[2] M. Khonji, Y. Iraqi, and A. Jones, “Phishing detection: a literature
survey,” IEEE Communications Surveys & Tutorials, vol. 15, no. 4, pp. [16] Developer Information. https://www.phishtank.com/developer_info.php.
2091–2121, 2013. [Last accessed 11/2019].
[3] M. Cova, C. Kruegel, and G. Vigna, “Detection and analysis of driveby- [17] URLhaus Database Dump. https://urlhaus.abuse.ch/downloads/csv/.
download attacks and malicious javascript code,” in Proceedings of the [Ngày truy nhập 11/2019].
19th international conference on World wide web. ACM, 2010, pp. 281– [18] Dataset URL. http://downloads.majestic.com/majestic_million.csv. [Last
290. accessed 10/2019].
[4] R. Heartfield and G. Loukas, “A taxonomy of attacks and a survey of [19] Malicious_n_Non-MaliciousURL. https://www.kaggle.com/antonyj453/
defence mechanisms for semantic social engineering attacks,” ACM urldataset#data.csv. [Last accessed 11/2019].
Computing Surveys (CSUR), vol. 48, no. 3, p. 37, 2015. [20] chrome.zip.
[5] Internet Security Threat Report (ISTR) 2019–Symantec. https://drive.google.com/file/d/13G_Ndr4hMFx_qWyTEjHuOyJmHFW
https://www.symantec.com/content/dam/symantec/docs/reports/istr-24- D0Gud/view?fbclid=IwAR0SLVCrvjHHGmoHZH97nXN3Bm-
2019-en.pdf [Last accessed 10/2019]. DMY7jG4SOsKZYLAZjTFgeoJADfli64-g. [Last accessed 12/2019].
[6] S. Sheng, B. Wardman, G. Warner, L. F. Cranor, J. Hong, and C. Zhang,
“An empirical analysis of phishing blacklists,” in Proceedings of Sixth
Conference on Email and Anti-Spam (CEAS), 2009.
153 | P a g e
www.ijacsa.thesai.org