0% found this document useful (0 votes)
55 views

NLP - PBL - Project Report - Draft.02

This document is a mini project submission by Aryan Sahani, Ameya Chhatre, and Kapil Mankoskar for their Bachelor of Engineering degree in Artificial Intelligence and Machine Learning from ISBM College of Engineering, Pune, under Savitribai Phule Pune University. It includes a certificate signed by their internal guide and department head certifying the completion of their mini project, as well as an acknowledgements section thanking those who supported and guided their work. The document outlines their project analyzing malicious links using machine learning and natural language processing techniques.

Uploaded by

Harshdip Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

NLP - PBL - Project Report - Draft.02

This document is a mini project submission by Aryan Sahani, Ameya Chhatre, and Kapil Mankoskar for their Bachelor of Engineering degree in Artificial Intelligence and Machine Learning from ISBM College of Engineering, Pune, under Savitribai Phule Pune University. It includes a certificate signed by their internal guide and department head certifying the completion of their mini project, as well as an acknowledgements section thanking those who supported and guided their work. The document outlines their project analyzing malicious links using machine learning and natural language processing techniques.

Uploaded by

Harshdip Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

MINI PROJECT

(Project Based Learning)

NATURAL LAGUAGE PROCESSING


BACHELOR OF ENGINEERING
Artificial Intelligence And Machine Learning
SUBMITTED BY
Aryan Sahani
Ameya Chhatre
Kapil Mankoskar

Peoples Empowerment Groups


ISBM COLLEGE OF ENGINEERING,
PUNE

Department of AI&ML Engineering


CERTIFICATE
This is to certify that students from Second Year AIML Engineering have successfully
completed their mini project project based learning ISB&M College of engineering Degree in
Artificial Intelligence and Machine Learning under Savitribai Phule Pune University.

Group Member:
ARYAN SAHANI
AMEYA CHHATRE
KAPIL MANKOSKAR

(Prof. Kirti Randhe) (P.K .Srivastava) (Prof. Kirti Randhe)


(HOD) (Principal) ( Internal Guide)
Acknowledgement
I feel great pleasure in expressing my deepest sense of gratitude and sincere
thanks to my guide Prof. Kirti Randhe for their valuable guidance during
the Project work, without which it would have been a very difficult task.
This acknowledgement would be incomplete without expressing my special
thanks to Prof. Kirti Randhe Maam Head of the Department (AI&DS )
for their support during the work. I would also like to extend my heartfelt
gratitude to my Principal, Dr. P.K. Shrivastav who provided a lot of
valuable support, mostly being behind the veils of college bureaucracy.

Finally I would like to thank all the Teaching, Non- Teaching


staff members of my department, my parents and my colleagues
those who helped me directly or indirectly for completing of this
Project successfully.
ABSTRACT

With the increasing prevalence of online threats and cyber-attacks, there is a pressing need
for effective methods to detect and mitigate malicious links shared on the internet. This
abstract presents a novel approach to address this challenge by leveraging the power of
machine learning and natural language processing (NLP) techniques.
The proposed malicious link detector utilizes a combination of supervised learning
algorithms and advanced NLP models to analyze the textual content and context of links,
aiming to classify them as either malicious or benign. The system operates in real-time,
providing prompt identification and mitigation of potential security risks.
The detection process involves several key stages. Initially, a comprehensive dataset of
labeled examples is collected, consisting of both malicious and benign links. Features are
extracted from these examples, capturing relevant information such as URL structure, domain
reputation, and content characteristics.
Next, a supervised learning algorithm, such as a support vector machine (SVM) or a random
forest classifier, is trained on the labeled dataset. These algorithms learn to differentiate
between malicious and benign links based on the extracted features. To enhance the system's
performance, state-of-the-art NLP models, such as transformer-based architectures like BERT
or GPT, are incorporated to capture semantic relationships and contextual information present
in the link's text.
During the inference phase, the trained model applies the learned patterns and rules to unseen
links, assessing their maliciousness probability. By examining the textual content and
leveraging NLP techniques, the system can identify suspicious patterns, phishing attempts,
malware propagation, or other forms of malicious intent.
To evaluate the effectiveness of the proposed malicious link detector, extensive experiments
are conducted using diverse datasets comprising both known malicious links and legitimate
URLs. The system's performance is measured in terms of metrics such as accuracy, precision,
recall, and F1 score.
while keeping false positives at a minimum. The proposed system holds significant promise
in strengthening cybersecurity defenses, assisting internet users, and safeguarding The results
demonstrate the efficacy of the machine learning and NLP-based approach in accurately
identifying malicious links, achieving high detection rates against potential threats in real-
time.
Overall, this abstract outlines a novel approach for detecting malicious links by harnessing
the power of machine learning and NLP techniques. The integration of supervised learning
algorithms and advanced NLP models provides an effective means of identifying and
mitigating security risks, ultimately contributing to enhanced online safety and protecting
users from cyber threats
Contents

• Introduction------------------------------------------6
• URL---------------------------------------------------8
• Malicious URL--------------------------------------9
• Problem Statement----------------------------------10
• Project flow------------------------------------------11
• Dataset description----------------------------------11
• Wordclouds-------------------------------------------12
• Libraries-----------------------------------------------12
• Loading dataset---------------------------------------12
• Feature engineering----------------------------------13
• EDA----------------------------------------------------20
• Label encoding----------------------------------------21
• Segregating features & target variable-------------22
• Traing & testing---------------------------------------22
• Model building----------------------------------------23
• Feature importance------------------------------------25
• Model prediction---------------------------------------26
• Conclusion
• References
CHAPTER1: Introduction

Natural language processing is the ability of a computer program to understand


human language as it is spoken and written ( referred to as the natural language). The
essence of Natural Language Processing lies in making computers understand the
natural language. Computers can understand the structured form of data like
spreadsheets, charts, tables, etc. Human languages, text, and voices are an
unstructured category of data which makes it difficult for computers to understand.
From here arises the need for Natural Language Processing. Computers can’t truly
understand human language properly, they can distinguish and try categorizing
various parts of speech based on previously fed data and experiences.

The Timeline of NLP Evolution


• 1950s: The first traces of NLP came in the 1950s, when rule-based methods were
used to build NLP systems. These were primarily focused on Machine Translation,
and came into high demand as a result of World War II and the need for effective
translations. Uses included word/sentence analysis, question-answering, and machine
translation.
• 1980s: Computational grammar became an active field of research. Grammar tools
and resources became more available and in demand.
• 1990s: The 1990s saw a booming development of the web and a new interest in
artificial intelligence. This created an abundance of knowledge and drove statistical
learning methods to work on NLP tasks. Statistical learning learns from a specific
dataset and describes its features.
• 2012: Deep Learning took over statistical learning, producing drastic improvements
in the NLP system. Deep Learning deep dives into raw data and learns its attributes.
• Current day: There's a huge demand for machines that can talk and understand our
needs, and NLP is the key to that door. Just look at products like Alexa and chatbots.
The neural network-based NLP (referred to as ‘neural NLP’) framework has achieved
new levels of quality and has become the governing approach for NLP. Deep learning
makes tasks such as MT, machine reading comprehension (MRC), chatbot, etc.
considerably easier.
The goal of NLP is to enable computers to understand and interpret human language
in a way that is similar to how humans process language. There are 5 major
challenges that we face in NLP they are as follows:-
1. Training Data
NLP is mainly about studying the language and to be proficient, it is essential to
spend a substantial amount of time listening, reading, and understanding it. NLP
systems focus on skewed and inaccurate data to learn inefficiently and incorrectly.
2. Development Time
The total time taken to develop an NLP system is higher. AI evaluates the data points
to process them and use them accordingly. The GPUs and deep network work on
training the datasets that can be reduced by a few hours. The pre-existing NLP
technologies can help in developing the product from scratch.
3. Homonyms
Another major challenge of NLP is Homonyms which mean words with multiple
meanings. Humans can interpret the meaning behind words that have multiple
meanings according to the situation but for machines, it can be difficult to identify.
4. Misspellings
It is not uncommon for humans to make spelling mistakes that can be difficult to
interpret. The machine needs to detect the work properly and hence it is essential to
employ NLP technology to progress and identify the misspellings.
5. False Positives
NLP can detect addressable and intelligible worlds but false positives or uncertainty is
something that can be difficult for them. The developers need to create an NLP
system to clear up and identify the uncertainty.
CHAPTER2: What is URL?
The Uniform Resource Locator (URL) is the well-defined structured
format unique address for accessing websites over World Wide Web
(WWW).

Generally, there are three basic components that make up a legitimate


URL

i.) Protocol: It is basically an identifier that determines what protocol


to use e.g., HTTP, HTTPS, etc.

ii) Hostname: Also known as the resource name. It contains the IP


address or the domain name where the actual resource is located.

iii) Path: It specifies the actual path where the resource is located

As per the figure, wisdomml.in.edu is the domain name. The top-level


domain is another component of the domain name that tells the nature
of the website i.e, commercial (.com), educational (.edu), organization
(.edu), etc.

Fig. 1. Components of a URL


CHAPTER3: What is Malicious URL?
Modified or compromised URLs employed for cyber attacks are
known as malicious URLs.

A malicious URL or website generally contains different types of


trojans, malware, unsolicited content in the form of phishing, drive-
by-download, spams.

The main objective of the malicious website is to fraud or steal the


personal or financial details of unsuspecting users. Due to the ongoing
COVID-19 pandemic the incidents of cybercrime increased manifold.
According to Symantec Internet Security Threat Report (ISTR)
2019, malicious URLs are a highly used technique in cyber crimes.

In this article, we address the detection of malicious URLs as a multi-


class classification problem by classifying the raw URLs into
different class types such as benign or safe URLs, phishing URLs,
malware URLs, or defacement URLs.

Problem statement

In this case study, we address the detection of malicious URLs as a


multi-class classification problem. In this case study, we classify the
raw URLs into different class types such as benign or safe URLs,
phishing URLs, malware URLs, or defacement URLs.

Project flow

As we know machine learning algorithms only support numeric inputs


so we will create lexical numeric features from input URLs. So the
input to machine learning algorithms will be the numeric lexical
features rather than actual raw URLs. If you don’t know about lexical
features you can refer to the discussion about a lexical feature in
StackOverflow.
So, in this case study, we will be using three well-known machine
learning ensemble classifiers namely Random Forest, Light
GBM, and XGBoost.

Later, we will also compare their performance and plot average


feature importance plot to understand which features are important in
predicting malicious URLs.

Dataset description

In this case study, we will be using a Malicious URLs


dataset of 6,51,191 URLs, out of which 4,28,103 benign or safe
URLs, 96,457 defacement URLs, 94,111 phishing URLs,
and 32,520 malware URLs.

Now, let’s discuss different types of URLs in our dataset i.e., Benign,
Malware, Phishing, and Defacement URLs.

• Benign URLs: These are safe to browse URLs. Some of


the examples of benign URLs are as follows:
• mp3raid.com/music/krizz_kaliko.html
• infinitysw.com
• google.co.in
• myspace.com
• Malware URLs: These type of URLs inject malware into
the victim’s system once he/she visit such URLs. Some of
the examples of malware URLs are as follows:
• proplast.co.nz
• http://103.112.226.142:36308/Mozi.m
• microencapsulation.readmyweather.com
• xo3fhvm5lcvzy92q.download
• Defacement URLs: Defacement URLs are generally
created by hackers with the intention of breaking into
a web server and replacing the hosted website with one of
their own, using techniques such as code injection, cross-
site scripting, etc. Common targets
of defacement URLs are religious websites, government
websites, bank websites, and corporate websites. Some of
the examples of defacement URLs are as follows:
• http://www.vnic.co/khach-hang.html
• http://www.raci.it/component/user/reset.html
• http://www.approvi.com.br/ck.htm
• http://www.juventudelirica.com.br/index.html
• Phishing URLs: By creating phishing URLs, hackers try to
steal sensitive personal or financial information such as
login credentials, credit card numbers, internet banking
details, etc. Some of the examples of phishing URLs are
shown below:
• roverslands.net
• corporacionrossenditotours.com
• http://drive-google-com.fanalav.com/6a7ec96d6a
• citiprepaid-salarysea-at.tk
CHAPTER4: Wordcloud of URLs

The word cloud helps in understanding the pattern of words/tokens in


particular target labels.

It is one of the most appealing techniques of natural language


processing for understanding the pattern of word distribution.

As we can see in the below figure word cloud of benign URLs is


pretty obvious having frequent tokens such as html, com, org,
wiki etc. Phishing URLs have frequent tokens as tools, ietf, www,
index, battle, net whereas html, org, html are higher frequency tokens
as these URLs try to mimick original URLs for deceiving the users.

The word cloud of malware URLs has higher frequency tokens


of exe, E7, BB, MOZI. These tokens are also obvious as malware
URLs try to install trojans in the form of executable files over the
users’ system once the user visits those URLs.

The defacement URLs’ intention is to modify the original website’s


code and this is the reason that tokens in its word cloud are more
common development terms such as index, php, itemid, https,
option, etc.
CHAPTER5: Importing Libraries
In this step, we will import all the necessary python libraries which
will be used in this project.

import pandas as pd
import itertools
from sklearn.metrics import classification_report,confusion_matrix,
accuracy_score
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import xgboost as xgb
from lightgbm import LGBMClassifier
import os
import seaborn as sns
from wordcloud import WordCloud
Next, we will load the dataset and will check sample records in the
dataset to get an understanding of the data.
CHAPTER6: Loading dataset
In this step, we will import the dataset using the pandas library and
check the sample entries in the dataset.

df=pd.read_csv('malicious_phish.csv')
print(df.shape)
df.head()

So from the above output, we can observe that the dataset


has 6,51,191 records with two columns url containing the raw URLs
and type which is the target variable.

Next, we will move towards the feature engineering part in which we


will create lexical features from raw URLs.
CHAPTER7: Feature engineering
In this step, we will extract the following lexical features from raw
URLs, as these features will be used as the input features for training
the machine learning model. The following features are created as
follows:

• having_ip_address: Generally cyber attackers use an IP


address in place of the domain name to hide the identity of
the website. this feature will check whether the URL has
IP address or not.
• abnormal_url: This feature can be extracted from
the WHOIS database. For a legitimate website, identity is
typically part of its URL.
• google_index: In this feature, we check whether the URL
is indexed in google search console or not.
• Count. : The phishing or malware websites generally use
more than two sub-domains in the URL. Each domain is
separated by dot (.). If any URL contains more than three
dots(.), then it increases the probability of a malicious site.
• Count-www: Generally most of the safe websites have
one www in its URL. This feature helps in detecting
malicious websites if the URL has no or more than one
www in its URL.
• count@: The presence of the “@” symbol in the URL
ignores everything previous to it.
• Count_dir: The presence of multiple directories in the
URL generally indicates suspicious websites.
• Count_embed_domain: The number of the embedded
domains can be helpful in detecting malicious URLs. It
can be done by checking the occurrence of “//” in the
URL.
• Suspicious words in URL: Malicious URLs generally
contain suspicious words in the URL such as PayPal,
login, sign in, bank, account, update, bonus, service,
ebayisapi, token, etc. We have found the presence of such
frequently occurring suspicious words in the URL as a
binary variable i.e., whether such words present in the
URL or not.
• Short_url: This feature is created to identify whether the
URL uses URL shortening services like bit. \ly, goo.gl,
go2l.ink, etc.
• Count_https: Generally malicious URLs do not use
HTTPS protocols as it generally requires user credentials
and ensures that the website is safe for transactions. So,
the presence or absence of HTTPS protocol in the URL is
an important feature.
• Count_http: Most of the time, phishing or malicious
websites have more than one HTTP in their URL whereas
safe sites have only one HTTP.
• Count%: As we know URLs cannot contain spaces. URL
encoding normally replaces spaces with symbol (%). Safe
sites generally contain less number of spaces whereas
malicious websites generally contain more spaces in their
URL hence more number of %.
• Count?: The presence of symbol (?) in URL denotes a
query string that contains the data to be passed to the
server. More number of ? in URL definitely indicates
suspicious URL.
• Count-: Phishers or cybercriminals generally add dashes(-
) in prefix or suffix of the brand name so that it looks
genuine URL. For example. www.flipkart-india.com.
• Count=: Presence of equal to (=) in URL indicates
passing of variable values from one form page t another. It
is considered as riskier in URL as anyone can change the
values to modify the page.
• url_length: Attackers generally use long URLs to hide the
domain name. We found the average length of a safe URL
is 74.
• hostname_length: The length of the hostname is also an
important feature for detecting malicious URLs.
• First directory length: This feature helps in determining
the length of the first directory in the URL. So looking for
the first ‘/’ and counting the length of the URL up to this
point helps in finding the first directory length of the URL.
For accessing directory level information we need to
install python library TLD. You can check this link for
installing TLD.
• Length of top-level domains: A top-level domain (TLD)
is one of the domains at the highest level in the
hierarchical Domain Name System of the Internet. For
example, in the domain name www.example.com, the top-
level domain is com. So, the length of TLD is also
important in identifying malicious URLs. As most of the
URLs have .com extension. TLDs in the range from 2 to 3
generally indicate safe URLs.
• Count_digits: The presence of digits in URL generally
indicate suspicious URLs. Safe URLs generally do not
have digits so counting the number of digits in URL is an
important feature for detecting malicious URLs.
• Count_letters: The number of letters in the URL also
plays a significant role in identifying malicious URLs. As
attackers try to increase the length of the URL to hide the
domain name and this is generally done by increasing the
number of letters and digits in the URL.
The code for creating above mentioned features is shared below.

import re
#Use of IP or not in domain
def having_ip_address(url):
match = re.search(
'(([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-
5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.'
'([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\/)|' # IPv4
'((0x[0-9a-fA-F]{1,2})\\.(0x[0-9a-fA-F]{1,2})\\.(0x[0-9a-fA-
F]{1,2})\\.(0x[0-9a-fA-F]{1,2})\\/)' # IPv4 in hexadecimal
'(?:[a-fA-F0-9]{1,4}:){7}[a-fA-F0-9]{1,4}', url) # Ipv6
if match:
# print match.group()
return 1
else:
# print 'No matching pattern found'
return 0
df['use_of_ip'] = df['url'].apply(lambda i: having_ip_address(i))
from urllib.parse import urlparse
def abnormal_url(url):
hostname = urlparse(url).hostname
hostname = str(hostname)
match = re.search(hostname, url)
if match:
# print match.group()
return 1
else:
# print 'No matching pattern found'
return 0
df['abnormal_url'] = df['url'].apply(lambda i: abnormal_url(i))
#pip install googlesearch-python
from googlesearch import search
def google_index(url):
site = search(url, 5)
return 1 if site else 0
df['google_index'] = df['url'].apply(lambda i: google_index(i))
def count_dot(url):
count_dot = url.count('.')
return count_dot
df['count.'] = df['url'].apply(lambda i: count_dot(i))
def count_www(url):
url.count('www')
return url.count('www')
df['count-www'] = df['url'].apply(lambda i: count_www(i))
def count_atrate(url):
return url.count('@')
df['count@'] = df['url'].apply(lambda i: count_atrate(i))
def no_of_dir(url):
urldir = urlparse(url).path
return urldir.count('/')
df['count_dir'] = df['url'].apply(lambda i: no_of_dir(i))
def no_of_embed(url):
urldir = urlparse(url).path
return urldir.count('//')
df['count_embed_domian'] = df['url'].apply(lambda i:
no_of_embed(i))
def shortening_service(url):
match =
re.search('bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.
im|is\.gd|cli\.gs|'
'yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|sn
ipurl\.com|'
'short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.co
m|fic\.kr|loopt\.us|'
'doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\
.co|lnkd\.in|'
'db\.tt|qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|
ity\.im|'
'q\.gs|is\.gd|po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u
\.bb|yourls\.org|'
'x\.co|prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1
url\.com|tweez\.me|v\.gd|'
'tr\.im|link\.zip\.net',
url)
if match:
return 1
else:
return 0
df['short_url'] = df['url'].apply(lambda i: shortening_service(i))
def count_https(url):
return url.count('https')
df['count-https'] = df['url'].apply(lambda i : count_https(i))
def count_http(url):
return url.count('http')
df['count-http'] = df['url'].apply(lambda i : count_http(i))
def count_per(url):
return url.count('%')
df['count%'] = df['url'].apply(lambda i : count_per(i))
def count_ques(url):
return url.count('?')
df['count?'] = df['url'].apply(lambda i: count_ques(i))
def count_hyphen(url):
return url.count('-')
df['count-'] = df['url'].apply(lambda i: count_hyphen(i))
def count_equal(url):
return url.count('=')
df['count='] = df['url'].apply(lambda i: count_equal(i))
def url_length(url):
return len(str(url))
#Length of URL
df['url_length'] = df['url'].apply(lambda i: url_length(i))
#Hostname Length
def hostname_length(url):
return len(urlparse(url).netloc)
df['hostname_length'] = df['url'].apply(lambda i: hostname_length(i))
df.head()
def suspicious_words(url):
match =
re.search('PayPal|login|signin|bank|account|update|free|lucky|service|b
onus|ebayisapi|webscr',
url)
if match:
return 1
else:
return 0
df['sus_url'] = df['url'].apply(lambda i: suspicious_words(i))
def digit_count(url):
digits = 0
for i in url:
if i.isnumeric():
digits = digits + 1
return digits
df['count-digits']= df['url'].apply(lambda i: digit_count(i))
def letter_count(url):
letters = 0
for i in url:
if i.isalpha():
letters = letters + 1
return letters
df['count-letters']= df['url'].apply(lambda i: letter_count(i))
# pip install tld
from urllib.parse import urlparse
from tld import get_tld
import os.path
#First Directory Length
def fd_length(url):
urlpath= urlparse(url).path
try:
return len(urlpath.split('/')[1])
except:
return 0
df['fd_length'] = df['url'].apply(lambda i: fd_length(i))
#Length of Top Level Domain
df['tld'] = df['url'].apply(lambda i: get_tld(i,fail_silently=True))
def tld_length(tld):
try:
return len(tld)
except:
return -1
df['tld_length'] = df['tld'].apply(lambda i: tld_length(i))
So, now after creating the above 22 features, the dataset looks like the
below.
Now, in the next step, we drop the irrelevant columns
i.e., URL,google_index, and tld.

The reason for dropping the URL column is that we have already
extracted relevant features from it that can be used as input in
machine learning algorithms.
The tld column is dropped because it is the indirect textual column as
for finding the length of the top-level domain we have created tld
column.

The google_index feature denotes if the URL is indexed in google


search console or not. In this dataset, all the URLs are google
indexed and have a value of 1.
CHAPTER8: Exploratory Data Analysis (EDA)
In this step, we will check the distribution of different features for all
four classes of URLs.

As we can observe from the above distribution


of use_ip_address feature, only malware URLs have IP addresses. In
the case of abnormal_url, defacement URLs have higher distribution.

From the distribution of suspicious_urls, it is clear that benign URLs


have highest distribution while phishing URLs have a second highest
distribution. As suspicious URLs consist of transaction and payment-
related keywords and generally genuine banking or payment-related
URLs consist of such keywords that’s why benign URLs have the
highest distribution.

As per the short_url distribution, we can observe that benign URLs


have the highest short URLs as we know that generally, we use URL
shortening services for easily sharing long-length URLs.
CHAPTER9: Label Encoding

After that, the most important step is to label and encode the target
variable (type) so that it can be converted into numerical categories
0,1,2, and 3. As machine learning algorithms only understand numeric
target variable.

from sklearn.preprocessing import LabelEncoder


lb_make = LabelEncoder()
df["type_code"] = lb_make.fit_transform(df["type"])
Segregating Feature and Target variables

So, in the next step, we have created a predictor and target variable.
Here predictor variables are the independent variables i.e., features of
URL, and target variable type.

#Predictor Variables
# filtering out google_index as it has only 1 value
X = df[['use_of_ip','abnormal_url', 'count.', 'count-www', 'count@',
'count_dir', 'count_embed_domian', 'short_url', 'count-https',
'count-http', 'count%', 'count?', 'count-', 'count=', 'url_length',
'hostname_length', 'sus_url', 'fd_length', 'tld_length', 'count-digits',
'count-letters']]
#Target Variable
y = df['type_code']
CHAPTER10: Training & Test Split

The next step is to split the dataset into train and test sets. We have
split the dataset into 80:20 ratio i.e., 80% of the data was used to train
the machine learning models, and the rest 20% was used to test the
model.

As we know we have an imbalanced dataset. The reason for this is


around 66% of the data has benign
URLs, 5% malware, 14% phishing, and 15% defacement URLs. So
after randomly splitting the dataset into train and test, it may happen
that the distribution of different categories got disturbed which will
highly affect the performance of the machine learning model. So to
maintain the same proportion of the target variable stratification is
needed.

This stratify parameter makes a split so that the proportion of values


in the sample produced will be the same as the proportion of values
provided to the parameter stratify.

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,


test_size=0.2,shuffle=True, random_state=5)
So, now we are ready for the most awaited part which is Model
Building !!!
CHAPTER11: Model building
In this step, we will build three tree-based ensemble machine learning
models i.e., Light GBM, XGBoost, and Random Forest.

The code for building machine learning models is shared below.

# Random Forest Model


from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100,max_features='sqrt')
rf.fit(X_train,y_train)
y_pred_rf = rf.predict(X_test)
print(classification_report(y_test,y_pred_rf,target_names=['benign',
'defacement','phishing','malware']))
score = metrics.accuracy_score(y_test, y_pred_rf)
print("accuracy: %0.3f" % score)
#XGboost
xgb_c = xgb.XGBClassifier(n_estimators= 100)
xgb_c.fit(X_train,y_train)
y_pred_x = xgb_c.predict(X_test)
print(classification_report(y_test,y_pred_x,target_names=['benign',
'defacement','phishing','malware']))
score = metrics.accuracy_score(y_test, y_pred_x)
print("accuracy: %0.3f" % score)
# Light GBM Classifier
lgb = LGBMClassifier(objective='multiclass',boosting_type=
'gbdt',n_jobs = 5,
silent = True, random_state=5)
LGB_C = lgb.fit(X_train, y_train)
y_pred_lgb = LGB_C.predict(X_test)
print(classification_report(y_test,y_pred_lgb,target_names=['benign',
'defacement','phishing','malware']))
score = metrics.accuracy_score(y_test, y_pred_lgb)
print("accuracy: %0.3f" % score)
Model evaluation & comparison
After fitting the model, as shown above, we have made predictions on
the test set. The performance of Light GBM, XGBoost, and Random
Forest are shown below.

From the above result, it is evident that Random Forest shows the
best performance in terms of test accuracy as it attains the highest
accuracy of 96.6% with a higher detection rate for benign,
defacement, phishing, and malware.

So based on the above performance, we have selected Random Forest


as our main model for detecting malicious URLs and in the next step,
we will also plot the feature importance plot.
CHAPTER12: Feature Importance
After selecting our model i.e., Random Forest, next, we will be
checking highly contributing features. The code for plotting feature
importance plot.

feat_importances = pd.Series(rf.feature_importances_,
index=X_train.columns)
feat_importances.sort_values().plot(kind="barh",figsize=(10, 6))

From the above plot, we can observe that the top 5 features for
detecting malicious URLs are hostname_length, count_dir, count-
www, fd_length, and url_length.
CHAPTER13: Model prediction

In this final step, we will predict malicious URLs using our best-
performed model i.e., Random Forest.

The code for predicting raw URLs using our saved model is given
below:

def main(url):
status = []
status.append(having_ip_address(url))
status.append(abnormal_url(url))
status.append(count_dot(url))
status.append(count_www(url))
status.append(count_atrate(url))
status.append(no_of_dir(url))
status.append(no_of_embed(url))
status.append(shortening_service(url))
status.append(count_https(url))
status.append(count_http(url))
status.append(count_per(url))
status.append(count_ques(url))
status.append(count_hyphen(url))
status.append(count_equal(url))
status.append(url_length(url))
status.append(hostname_length(url))
status.append(suspicious_words(url))
status.append(digit_count(url))
status.append(letter_count(url))
status.append(fd_length(url))
tld = get_tld(url,fail_silently=True)
status.append(tld_length(tld))
return status
# predict function
def get_prediction_from_url(test_url):
features_test = main(test_url)
# Due to updates to scikit-learn, we now need a 2D array as a
parameter to the predict function.
features_test = np.array(features_test).reshape((1, -1))
pred = lgb.predict(features_test)
if int(pred[0]) == 0:
res="SAFE"
return res
elif int(pred[0]) == 1.0:
res="DEFACEMENT"
return res
elif int(pred[0]) == 2.0:
res="PHISHING"
return res
elif int(pred[0]) == 3.0:
res="MALWARE"
return res
# predicting sample raw URLs
urls =
['titaniumcorporate.co.za','en.wikipedia.org/wiki/North_Dakota']
for url in urls:
print(get_prediction_from_url(url))
CHAPTER14: Conclusion

In this article, we have demonstrated a machine learning approach to


detect Malicious URLs. We have created 22 lexical features from raw
URLs and trained three machine learning models XG Boost, Light
GBM, and Random forest. Further, we have compared the
performance of the 3 machine learning models and found
that Random forest outperformed others by attaining the highest
accuracy of 96.6%. By plotting the feature importance of Random
forest we found that hostname_length, count_dir, count-
www, fd_length, and url_length are the top 5 features for detecting
the malicious URLs. At last, we have coded the prediction function
for classifying any raw URL using our saved model i.e., Random
Forest.
CHAPTER15:Refrences

• "Speech and Language Processing: An Introduction to Natural Language


Processing, Computational Linguistics, and Speech Recognition" by
Daniel Jurafsky and James H. Martin.

• "Foundations of Statistical Natural Language Processing" by Christopher


D. Manning and Hinrich Schütze.

• Towards Data Science (https://towardsdatascience.com): A platform with


numerous NLP articles and tutorials.

• Natural Language Processing with Python (https://www.nltk.org/book/):


An online book introducing NLP using Python's NLTK library.

• Association for Computational Linguistics (ACL): One of the premier


conferences in NLP.

• Conference on Empirical Methods in Natural Language Processing


(EMNLP): A leading conference in NLP research

You might also like