0% found this document useful (0 votes)

55 views

NLP - PBL - Project Report - Draft.02

This document is a mini project submission by Aryan Sahani, Ameya Chhatre, and Kapil Mankoskar for their Bachelor of Engineering degree in Artificial Intelligence and Machine Learning from ISBM College of Engineering, Pune, under Savitribai Phule Pune University. It includes a certificate signed by their internal guide and department head certifying the completion of their mini project, as well as an acknowledgements section thanking those who supported and guided their work. The document outlines their project analyzing malicious links using machine learning and natural language processing techniques.

Uploaded by

Harshdip Patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views

NLP - PBL - Project Report - Draft.02

Uploaded by

Harshdip Patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

MINI PROJECT

(Project Based Learning)

NATURAL LAGUAGE PROCESSING

BACHELOR OF ENGINEERING
Artificial Intelligence And Machine Learning
SUBMITTED BY
Aryan Sahani
Ameya Chhatre
Kapil Mankoskar

Peoples Empowerment Groups

ISBM COLLEGE OF ENGINEERING,
PUNE

Department of AI&ML Engineering

CERTIFICATE
This is to certify that students from Second Year AIML Engineering have successfully
completed their mini project project based learning ISB&M College of engineering Degree in
Artificial Intelligence and Machine Learning under Savitribai Phule Pune University.

Group Member:
ARYAN SAHANI
AMEYA CHHATRE
KAPIL MANKOSKAR

(Prof. Kirti Randhe) (P.K .Srivastava) (Prof. Kirti Randhe)

(HOD) (Principal) ( Internal Guide)
Acknowledgement
I feel great pleasure in expressing my deepest sense of gratitude and sincere
thanks to my guide Prof. Kirti Randhe for their valuable guidance during
the Project work, without which it would have been a very difficult task.
This acknowledgement would be incomplete without expressing my special
thanks to Prof. Kirti Randhe Maam Head of the Department (AI&DS )
for their support during the work. I would also like to extend my heartfelt
gratitude to my Principal, Dr. P.K. Shrivastav who provided a lot of
valuable support, mostly being behind the veils of college bureaucracy.

Finally I would like to thank all the Teaching, Non- Teaching

staff members of my department, my parents and my colleagues
those who helped me directly or indirectly for completing of this
Project successfully.
ABSTRACT

With the increasing prevalence of online threats and cyber-attacks, there is a pressing need
for effective methods to detect and mitigate malicious links shared on the internet. This
abstract presents a novel approach to address this challenge by leveraging the power of
machine learning and natural language processing (NLP) techniques.
The proposed malicious link detector utilizes a combination of supervised learning
algorithms and advanced NLP models to analyze the textual content and context of links,
aiming to classify them as either malicious or benign. The system operates in real-time,
providing prompt identification and mitigation of potential security risks.
The detection process involves several key stages. Initially, a comprehensive dataset of
labeled examples is collected, consisting of both malicious and benign links. Features are
extracted from these examples, capturing relevant information such as URL structure, domain
reputation, and content characteristics.
Next, a supervised learning algorithm, such as a support vector machine (SVM) or a random
forest classifier, is trained on the labeled dataset. These algorithms learn to differentiate
between malicious and benign links based on the extracted features. To enhance the system's
performance, state-of-the-art NLP models, such as transformer-based architectures like BERT
or GPT, are incorporated to capture semantic relationships and contextual information present
in the link's text.
During the inference phase, the trained model applies the learned patterns and rules to unseen
links, assessing their maliciousness probability. By examining the textual content and
leveraging NLP techniques, the system can identify suspicious patterns, phishing attempts,
malware propagation, or other forms of malicious intent.
To evaluate the effectiveness of the proposed malicious link detector, extensive experiments
are conducted using diverse datasets comprising both known malicious links and legitimate
URLs. The system's performance is measured in terms of metrics such as accuracy, precision,
recall, and F1 score.
while keeping false positives at a minimum. The proposed system holds significant promise
in strengthening cybersecurity defenses, assisting internet users, and safeguarding The results
demonstrate the efficacy of the machine learning and NLP-based approach in accurately
identifying malicious links, achieving high detection rates against potential threats in real-
time.
Overall, this abstract outlines a novel approach for detecting malicious links by harnessing
the power of machine learning and NLP techniques. The integration of supervised learning
algorithms and advanced NLP models provides an effective means of identifying and
mitigating security risks, ultimately contributing to enhanced online safety and protecting
users from cyber threats
Contents

• Introduction------------------------------------------6
• URL---------------------------------------------------8
• Malicious URL--------------------------------------9
• Problem Statement----------------------------------10
• Project flow------------------------------------------11
• Dataset description----------------------------------11
• Wordclouds-------------------------------------------12
• Libraries-----------------------------------------------12
• Loading dataset---------------------------------------12
• Feature engineering----------------------------------13
• EDA----------------------------------------------------20
• Label encoding----------------------------------------21
• Segregating features & target variable-------------22
• Traing & testing---------------------------------------22
• Model building----------------------------------------23
• Feature importance------------------------------------25
• Model prediction---------------------------------------26
• Conclusion
• References
CHAPTER1: Introduction

Natural language processing is the ability of a computer program to understand

human language as it is spoken and written ( referred to as the natural language). The
essence of Natural Language Processing lies in making computers understand the
natural language. Computers can understand the structured form of data like
spreadsheets, charts, tables, etc. Human languages, text, and voices are an
unstructured category of data which makes it difficult for computers to understand.
From here arises the need for Natural Language Processing. Computers can’t truly
understand human language properly, they can distinguish and try categorizing
various parts of speech based on previously fed data and experiences.

The Timeline of NLP Evolution

• 1950s: The first traces of NLP came in the 1950s, when rule-based methods were
used to build NLP systems. These were primarily focused on Machine Translation,
and came into high demand as a result of World War II and the need for effective
translations. Uses included word/sentence analysis, question-answering, and machine
translation.
• 1980s: Computational grammar became an active field of research. Grammar tools
and resources became more available and in demand.
• 1990s: The 1990s saw a booming development of the web and a new interest in
artificial intelligence. This created an abundance of knowledge and drove statistical
learning methods to work on NLP tasks. Statistical learning learns from a specific
dataset and describes its features.
• 2012: Deep Learning took over statistical learning, producing drastic improvements
in the NLP system. Deep Learning deep dives into raw data and learns its attributes.
• Current day: There's a huge demand for machines that can talk and understand our
needs, and NLP is the key to that door. Just look at products like Alexa and chatbots.
The neural network-based NLP (referred to as ‘neural NLP’) framework has achieved
new levels of quality and has become the governing approach for NLP. Deep learning
makes tasks such as MT, machine reading comprehension (MRC), chatbot, etc.
considerably easier.
The goal of NLP is to enable computers to understand and interpret human language
in a way that is similar to how humans process language. There are 5 major
challenges that we face in NLP they are as follows:-
1. Training Data
NLP is mainly about studying the language and to be proficient, it is essential to
spend a substantial amount of time listening, reading, and understanding it. NLP
systems focus on skewed and inaccurate data to learn inefficiently and incorrectly.
2. Development Time
The total time taken to develop an NLP system is higher. AI evaluates the data points
to process them and use them accordingly. The GPUs and deep network work on
training the datasets that can be reduced by a few hours. The pre-existing NLP
technologies can help in developing the product from scratch.
3. Homonyms
Another major challenge of NLP is Homonyms which mean words with multiple
meanings. Humans can interpret the meaning behind words that have multiple
meanings according to the situation but for machines, it can be difficult to identify.
4. Misspellings
It is not uncommon for humans to make spelling mistakes that can be difficult to
interpret. The machine needs to detect the work properly and hence it is essential to
employ NLP technology to progress and identify the misspellings.
5. False Positives
NLP can detect addressable and intelligible worlds but false positives or uncertainty is
something that can be difficult for them. The developers need to create an NLP
system to clear up and identify the uncertainty.
CHAPTER2: What is URL?
The Uniform Resource Locator (URL) is the well-defined structured
format unique address for accessing websites over World Wide Web
(WWW).

Generally, there are three basic components that make up a legitimate

URL

i.) Protocol: It is basically an identifier that determines what protocol

to use e.g., HTTP, HTTPS, etc.

ii) Hostname: Also known as the resource name. It contains the IP

address or the domain name where the actual resource is located.

iii) Path: It specifies the actual path where the resource is located

As per the figure, wisdomml.in.edu is the domain name. The top-level

domain is another component of the domain name that tells the nature
of the website i.e, commercial (.com), educational (.edu), organization
(.edu), etc.

Fig. 1. Components of a URL

CHAPTER3: What is Malicious URL?
Modified or compromised URLs employed for cyber attacks are
known as malicious URLs.

A malicious URL or website generally contains different types of

trojans, malware, unsolicited content in the form of phishing, drive-
by-download, spams.

The main objective of the malicious website is to fraud or steal the

personal or financial details of unsuspecting users. Due to the ongoing
COVID-19 pandemic the incidents of cybercrime increased manifold.
According to Symantec Internet Security Threat Report (ISTR)
2019, malicious URLs are a highly used technique in cyber crimes.

In this article, we address the detection of malicious URLs as a multi-

class classification problem by classifying the raw URLs into
different class types such as benign or safe URLs, phishing URLs,
malware URLs, or defacement URLs.

Problem statement

In this case study, we address the detection of malicious URLs as a

multi-class classification problem. In this case study, we classify the
raw URLs into different class types such as benign or safe URLs,
phishing URLs, malware URLs, or defacement URLs.

Project flow

As we know machine learning algorithms only support numeric inputs

so we will create lexical numeric features from input URLs. So the
input to machine learning algorithms will be the numeric lexical
features rather than actual raw URLs. If you don’t know about lexical
features you can refer to the discussion about a lexical feature in
StackOverflow.
So, in this case study, we will be using three well-known machine
learning ensemble classifiers namely Random Forest, Light
GBM, and XGBoost.

Later, we will also compare their performance and plot average

feature importance plot to understand which features are important in
predicting malicious URLs.

Dataset description

In this case study, we will be using a Malicious URLs

dataset of 6,51,191 URLs, out of which 4,28,103 benign or safe
URLs, 96,457 defacement URLs, 94,111 phishing URLs,
and 32,520 malware URLs.

Now, let’s discuss different types of URLs in our dataset i.e., Benign,
Malware, Phishing, and Defacement URLs.

• Benign URLs: These are safe to browse URLs. Some of

the examples of benign URLs are as follows:
• mp3raid.com/music/krizz_kaliko.html
• infinitysw.com
• google.co.in
• myspace.com
• Malware URLs: These type of URLs inject malware into
the victim’s system once he/she visit such URLs. Some of
the examples of malware URLs are as follows:
• proplast.co.nz
• http://103.112.226.142:36308/Mozi.m
• microencapsulation.readmyweather.com
• xo3fhvm5lcvzy92q.download
• Defacement URLs: Defacement URLs are generally
created by hackers with the intention of breaking into
a web server and replacing the hosted website with one of
their own, using techniques such as code injection, cross-
site scripting, etc. Common targets
of defacement URLs are religious websites, government
websites, bank websites, and corporate websites. Some of
the examples of defacement URLs are as follows:
• http://www.vnic.co/khach-hang.html
• http://www.raci.it/component/user/reset.html
• http://www.approvi.com.br/ck.htm
• http://www.juventudelirica.com.br/index.html
• Phishing URLs: By creating phishing URLs, hackers try to
steal sensitive personal or financial information such as
login credentials, credit card numbers, internet banking
details, etc. Some of the examples of phishing URLs are
shown below:
• roverslands.net
• corporacionrossenditotours.com
• http://drive-google-com.fanalav.com/6a7ec96d6a
• citiprepaid-salarysea-at.tk
CHAPTER4: Wordcloud of URLs

The word cloud helps in understanding the pattern of words/tokens in

particular target labels.

It is one of the most appealing techniques of natural language

processing for understanding the pattern of word distribution.

As we can see in the below figure word cloud of benign URLs is

pretty obvious having frequent tokens such as html, com, org,
wiki etc. Phishing URLs have frequent tokens as tools, ietf, www,
index, battle, net whereas html, org, html are higher frequency tokens
as these URLs try to mimick original URLs for deceiving the users.

The word cloud of malware URLs has higher frequency tokens

of exe, E7, BB, MOZI. These tokens are also obvious as malware
URLs try to install trojans in the form of executable files over the
users’ system once the user visits those URLs.

The defacement URLs’ intention is to modify the original website’s

code and this is the reason that tokens in its word cloud are more
common development terms such as index, php, itemid, https,
option, etc.
CHAPTER5: Importing Libraries
In this step, we will import all the necessary python libraries which
will be used in this project.

import pandas as pd
import itertools
from sklearn.metrics import classification_report,confusion_matrix,
accuracy_score
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import xgboost as xgb
from lightgbm import LGBMClassifier
import os
import seaborn as sns
from wordcloud import WordCloud
Next, we will load the dataset and will check sample records in the
dataset to get an understanding of the data.
CHAPTER6: Loading dataset
In this step, we will import the dataset using the pandas library and
check the sample entries in the dataset.

df=pd.read_csv('malicious_phish.csv')
print(df.shape)
df.head()

So from the above output, we can observe that the dataset

has 6,51,191 records with two columns url containing the raw URLs
and type which is the target variable.

Next, we will move towards the feature engineering part in which we

will create lexical features from raw URLs.
CHAPTER7: Feature engineering
In this step, we will extract the following lexical features from raw
URLs, as these features will be used as the input features for training
the machine learning model. The following features are created as
follows:

• having_ip_address: Generally cyber attackers use an IP

address in place of the domain name to hide the identity of
the website. this feature will check whether the URL has
IP address or not.
• abnormal_url: This feature can be extracted from
the WHOIS database. For a legitimate website, identity is
typically part of its URL.
• google_index: In this feature, we check whether the URL
is indexed in google search console or not.
• Count. : The phishing or malware websites generally use
more than two sub-domains in the URL. Each domain is
separated by dot (.). If any URL contains more than three
dots(.), then it increases the probability of a malicious site.
• Count-www: Generally most of the safe websites have
one www in its URL. This feature helps in detecting
malicious websites if the URL has no or more than one
www in its URL.
• count@: The presence of the “@” symbol in the URL
ignores everything previous to it.
• Count_dir: The presence of multiple directories in the
URL generally indicates suspicious websites.
• Count_embed_domain: The number of the embedded
domains can be helpful in detecting malicious URLs. It
can be done by checking the occurrence of “//” in the
URL.
• Suspicious words in URL: Malicious URLs generally
contain suspicious words in the URL such as PayPal,
login, sign in, bank, account, update, bonus, service,
ebayisapi, token, etc. We have found the presence of such
frequently occurring suspicious words in the URL as a
binary variable i.e., whether such words present in the
URL or not.
• Short_url: This feature is created to identify whether the
URL uses URL shortening services like bit. \ly, goo.gl,
go2l.ink, etc.
• Count_https: Generally malicious URLs do not use
HTTPS protocols as it generally requires user credentials
and ensures that the website is safe for transactions. So,
the presence or absence of HTTPS protocol in the URL is
an important feature.
• Count_http: Most of the time, phishing or malicious
websites have more than one HTTP in their URL whereas
safe sites have only one HTTP.
• Count%: As we know URLs cannot contain spaces. URL
encoding normally replaces spaces with symbol (%). Safe
sites generally contain less number of spaces whereas
malicious websites generally contain more spaces in their
URL hence more number of %.
• Count?: The presence of symbol (?) in URL denotes a
query string that contains the data to be passed to the
server. More number of ? in URL definitely indicates
suspicious URL.
• Count-: Phishers or cybercriminals generally add dashes(-
) in prefix or suffix of the brand name so that it looks
genuine URL. For example. www.flipkart-india.com.
• Count=: Presence of equal to (=) in URL indicates
passing of variable values from one form page t another. It
is considered as riskier in URL as anyone can change the
values to modify the page.
• url_length: Attackers generally use long URLs to hide the
domain name. We found the average length of a safe URL
is 74.
• hostname_length: The length of the hostname is also an
important feature for detecting malicious URLs.
• First directory length: This feature helps in determining
the length of the first directory in the URL. So looking for
the first ‘/’ and counting the length of the URL up to this
point helps in finding the first directory length of the URL.
For accessing directory level information we need to
install python library TLD. You can check this link for
installing TLD.
• Length of top-level domains: A top-level domain (TLD)
is one of the domains at the highest level in the
hierarchical Domain Name System of the Internet. For
example, in the domain name www.example.com, the top-
level domain is com. So, the length of TLD is also
important in identifying malicious URLs. As most of the
URLs have .com extension. TLDs in the range from 2 to 3
generally indicate safe URLs.
• Count_digits: The presence of digits in URL generally
indicate suspicious URLs. Safe URLs generally do not
have digits so counting the number of digits in URL is an
important feature for detecting malicious URLs.
• Count_letters: The number of letters in the URL also
plays a significant role in identifying malicious URLs. As
attackers try to increase the length of the URL to hide the
domain name and this is generally done by increasing the
number of letters and digits in the URL.
The code for creating above mentioned features is shared below.

import re
#Use of IP or not in domain
def having_ip_address(url):
match = re.search(
'(([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-
5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.'
'([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\/)|' # IPv4
'((0x[0-9a-fA-F]{1,2})\\.(0x[0-9a-fA-F]{1,2})\\.(0x[0-9a-fA-
F]{1,2})\\.(0x[0-9a-fA-F]{1,2})\\/)' # IPv4 in hexadecimal
'(?:[a-fA-F0-9]{1,4}:){7}[a-fA-F0-9]{1,4}', url) # Ipv6
if match:
# print match.group()
return 1
else:
# print 'No matching pattern found'
return 0
df['use_of_ip'] = df['url'].apply(lambda i: having_ip_address(i))
from urllib.parse import urlparse
def abnormal_url(url):
hostname = urlparse(url).hostname
hostname = str(hostname)
match = re.search(hostname, url)
if match:
# print match.group()
return 1
else:
# print 'No matching pattern found'
return 0
df['abnormal_url'] = df['url'].apply(lambda i: abnormal_url(i))
#pip install googlesearch-python
from googlesearch import search
def google_index(url):
site = search(url, 5)
return 1 if site else 0
df['google_index'] = df['url'].apply(lambda i: google_index(i))
def count_dot(url):
count_dot = url.count('.')
return count_dot
df['count.'] = df['url'].apply(lambda i: count_dot(i))
def count_www(url):
url.count('www')
return url.count('www')
df['count-www'] = df['url'].apply(lambda i: count_www(i))
def count_atrate(url):
return url.count('@')
df['count@'] = df['url'].apply(lambda i: count_atrate(i))
def no_of_dir(url):
urldir = urlparse(url).path
return urldir.count('/')
df['count_dir'] = df['url'].apply(lambda i: no_of_dir(i))
def no_of_embed(url):
urldir = urlparse(url).path
return urldir.count('//')
df['count_embed_domian'] = df['url'].apply(lambda i:
no_of_embed(i))
def shortening_service(url):
match =
re.search('bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.
im|is\.gd|cli\.gs|'
'yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|sn
ipurl\.com|'
'short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.co
m|fic\.kr|loopt\.us|'
'doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\
.co|lnkd\.in|'
'db\.tt|qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|
ity\.im|'
'q\.gs|is\.gd|po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u
\.bb|yourls\.org|'
'x\.co|prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1
url\.com|tweez\.me|v\.gd|'
'tr\.im|link\.zip\.net',
url)
if match:
return 1
else:
return 0
df['short_url'] = df['url'].apply(lambda i: shortening_service(i))
def count_https(url):
return url.count('https')
df['count-https'] = df['url'].apply(lambda i : count_https(i))
def count_http(url):
return url.count('http')
df['count-http'] = df['url'].apply(lambda i : count_http(i))
def count_per(url):
return url.count('%')
df['count%'] = df['url'].apply(lambda i : count_per(i))
def count_ques(url):
return url.count('?')
df['count?'] = df['url'].apply(lambda i: count_ques(i))
def count_hyphen(url):
return url.count('-')
df['count-'] = df['url'].apply(lambda i: count_hyphen(i))
def count_equal(url):
return url.count('=')
df['count='] = df['url'].apply(lambda i: count_equal(i))
def url_length(url):
return len(str(url))
#Length of URL
df['url_length'] = df['url'].apply(lambda i: url_length(i))
#Hostname Length
def hostname_length(url):
return len(urlparse(url).netloc)
df['hostname_length'] = df['url'].apply(lambda i: hostname_length(i))
df.head()
def suspicious_words(url):
match =
re.search('PayPal|login|signin|bank|account|update|free|lucky|service|b
onus|ebayisapi|webscr',
url)
if match:
return 1
else:
return 0
df['sus_url'] = df['url'].apply(lambda i: suspicious_words(i))
def digit_count(url):
digits = 0
for i in url:
if i.isnumeric():
digits = digits + 1
return digits
df['count-digits']= df['url'].apply(lambda i: digit_count(i))
def letter_count(url):
letters = 0
for i in url:
if i.isalpha():
letters = letters + 1
return letters
df['count-letters']= df['url'].apply(lambda i: letter_count(i))
# pip install tld
from urllib.parse import urlparse
from tld import get_tld
import os.path
#First Directory Length
def fd_length(url):
urlpath= urlparse(url).path
try:
return len(urlpath.split('/')[1])
except:
return 0
df['fd_length'] = df['url'].apply(lambda i: fd_length(i))
#Length of Top Level Domain
df['tld'] = df['url'].apply(lambda i: get_tld(i,fail_silently=True))
def tld_length(tld):
try:
return len(tld)
except:
return -1
df['tld_length'] = df['tld'].apply(lambda i: tld_length(i))
So, now after creating the above 22 features, the dataset looks like the
below.
Now, in the next step, we drop the irrelevant columns
i.e., URL,google_index, and tld.

The reason for dropping the URL column is that we have already
extracted relevant features from it that can be used as input in
machine learning algorithms.
The tld column is dropped because it is the indirect textual column as
for finding the length of the top-level domain we have created tld
column.

The google_index feature denotes if the URL is indexed in google

search console or not. In this dataset, all the URLs are google
indexed and have a value of 1.
CHAPTER8: Exploratory Data Analysis (EDA)
In this step, we will check the distribution of different features for all
four classes of URLs.

As we can observe from the above distribution

of use_ip_address feature, only malware URLs have IP addresses. In
the case of abnormal_url, defacement URLs have higher distribution.

From the distribution of suspicious_urls, it is clear that benign URLs

have highest distribution while phishing URLs have a second highest
distribution. As suspicious URLs consist of transaction and payment-
related keywords and generally genuine banking or payment-related
URLs consist of such keywords that’s why benign URLs have the
highest distribution.

As per the short_url distribution, we can observe that benign URLs

have the highest short URLs as we know that generally, we use URL
shortening services for easily sharing long-length URLs.
CHAPTER9: Label Encoding

After that, the most important step is to label and encode the target
variable (type) so that it can be converted into numerical categories
0,1,2, and 3. As machine learning algorithms only understand numeric
target variable.

from sklearn.preprocessing import LabelEncoder

lb_make = LabelEncoder()
df["type_code"] = lb_make.fit_transform(df["type"])
Segregating Feature and Target variables

So, in the next step, we have created a predictor and target variable.
Here predictor variables are the independent variables i.e., features of
URL, and target variable type.

#Predictor Variables
# filtering out google_index as it has only 1 value
X = df[['use_of_ip','abnormal_url', 'count.', 'count-www', 'count@',
'count_dir', 'count_embed_domian', 'short_url', 'count-https',
'count-http', 'count%', 'count?', 'count-', 'count=', 'url_length',
'hostname_length', 'sus_url', 'fd_length', 'tld_length', 'count-digits',
'count-letters']]
#Target Variable
y = df['type_code']
CHAPTER10: Training & Test Split

The next step is to split the dataset into train and test sets. We have
split the dataset into 80:20 ratio i.e., 80% of the data was used to train
the machine learning models, and the rest 20% was used to test the
model.

As we know we have an imbalanced dataset. The reason for this is

around 66% of the data has benign
URLs, 5% malware, 14% phishing, and 15% defacement URLs. So
after randomly splitting the dataset into train and test, it may happen
that the distribution of different categories got disturbed which will
highly affect the performance of the machine learning model. So to
maintain the same proportion of the target variable stratification is
needed.

This stratify parameter makes a split so that the proportion of values

in the sample produced will be the same as the proportion of values
provided to the parameter stratify.

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,

test_size=0.2,shuffle=True, random_state=5)
So, now we are ready for the most awaited part which is Model
Building !!!
CHAPTER11: Model building
In this step, we will build three tree-based ensemble machine learning
models i.e., Light GBM, XGBoost, and Random Forest.

The code for building machine learning models is shared below.

# Random Forest Model

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100,max_features='sqrt')
rf.fit(X_train,y_train)
y_pred_rf = rf.predict(X_test)
print(classification_report(y_test,y_pred_rf,target_names=['benign',
'defacement','phishing','malware']))
score = metrics.accuracy_score(y_test, y_pred_rf)
print("accuracy: %0.3f" % score)
#XGboost
xgb_c = xgb.XGBClassifier(n_estimators= 100)
xgb_c.fit(X_train,y_train)
y_pred_x = xgb_c.predict(X_test)
print(classification_report(y_test,y_pred_x,target_names=['benign',
'defacement','phishing','malware']))
score = metrics.accuracy_score(y_test, y_pred_x)
print("accuracy: %0.3f" % score)
# Light GBM Classifier
lgb = LGBMClassifier(objective='multiclass',boosting_type=
'gbdt',n_jobs = 5,
silent = True, random_state=5)
LGB_C = lgb.fit(X_train, y_train)
y_pred_lgb = LGB_C.predict(X_test)
print(classification_report(y_test,y_pred_lgb,target_names=['benign',
'defacement','phishing','malware']))
score = metrics.accuracy_score(y_test, y_pred_lgb)
print("accuracy: %0.3f" % score)
Model evaluation & comparison
After fitting the model, as shown above, we have made predictions on
the test set. The performance of Light GBM, XGBoost, and Random
Forest are shown below.

From the above result, it is evident that Random Forest shows the
best performance in terms of test accuracy as it attains the highest
accuracy of 96.6% with a higher detection rate for benign,
defacement, phishing, and malware.

So based on the above performance, we have selected Random Forest

as our main model for detecting malicious URLs and in the next step,
we will also plot the feature importance plot.
CHAPTER12: Feature Importance
After selecting our model i.e., Random Forest, next, we will be
checking highly contributing features. The code for plotting feature
importance plot.

feat_importances = pd.Series(rf.feature_importances_,
index=X_train.columns)
feat_importances.sort_values().plot(kind="barh",figsize=(10, 6))

From the above plot, we can observe that the top 5 features for
detecting malicious URLs are hostname_length, count_dir, count-
www, fd_length, and url_length.
CHAPTER13: Model prediction

In this final step, we will predict malicious URLs using our best-
performed model i.e., Random Forest.

The code for predicting raw URLs using our saved model is given
below:

def main(url):
status = []
status.append(having_ip_address(url))
status.append(abnormal_url(url))
status.append(count_dot(url))
status.append(count_www(url))
status.append(count_atrate(url))
status.append(no_of_dir(url))
status.append(no_of_embed(url))
status.append(shortening_service(url))
status.append(count_https(url))
status.append(count_http(url))
status.append(count_per(url))
status.append(count_ques(url))
status.append(count_hyphen(url))
status.append(count_equal(url))
status.append(url_length(url))
status.append(hostname_length(url))
status.append(suspicious_words(url))
status.append(digit_count(url))
status.append(letter_count(url))
status.append(fd_length(url))
tld = get_tld(url,fail_silently=True)
status.append(tld_length(tld))
return status
# predict function
def get_prediction_from_url(test_url):
features_test = main(test_url)
# Due to updates to scikit-learn, we now need a 2D array as a
parameter to the predict function.
features_test = np.array(features_test).reshape((1, -1))
pred = lgb.predict(features_test)
if int(pred[0]) == 0:
res="SAFE"
return res
elif int(pred[0]) == 1.0:
res="DEFACEMENT"
return res
elif int(pred[0]) == 2.0:
res="PHISHING"
return res
elif int(pred[0]) == 3.0:
res="MALWARE"
return res
# predicting sample raw URLs
urls =
['titaniumcorporate.co.za','en.wikipedia.org/wiki/North_Dakota']
for url in urls:
print(get_prediction_from_url(url))
CHAPTER14: Conclusion

In this article, we have demonstrated a machine learning approach to

detect Malicious URLs. We have created 22 lexical features from raw
URLs and trained three machine learning models XG Boost, Light
GBM, and Random forest. Further, we have compared the
performance of the 3 machine learning models and found
that Random forest outperformed others by attaining the highest
accuracy of 96.6%. By plotting the feature importance of Random
forest we found that hostname_length, count_dir, count-
www, fd_length, and url_length are the top 5 features for detecting
the malicious URLs. At last, we have coded the prediction function
for classifying any raw URL using our saved model i.e., Random
Forest.
CHAPTER15:Refrences

• "Speech and Language Processing: An Introduction to Natural Language

Processing, Computational Linguistics, and Speech Recognition" by
Daniel Jurafsky and James H. Martin.

• "Foundations of Statistical Natural Language Processing" by Christopher

D. Manning and Hinrich Schütze.

• Towards Data Science (https://towardsdatascience.com): A platform with

numerous NLP articles and tutorials.

• Natural Language Processing with Python (https://www.nltk.org/book/):

An online book introducing NLP using Python's NLTK library.

• Association for Computational Linguistics (ACL): One of the premier

conferences in NLP.

• Conference on Empirical Methods in Natural Language Processing

(EMNLP): A leading conference in NLP research

Natural Language Processing With Python A Comprehensive Guide To NLP in The Age of AI For 2024 (Hayden Van Der Post) (Z-Library)
No ratings yet
Natural Language Processing With Python A Comprehensive Guide To NLP in The Age of AI For 2024 (Hayden Van Der Post) (Z-Library)
315 pages
Natural Language Processing (2) Finalll
No ratings yet
Natural Language Processing (2) Finalll
20 pages
Internet of Things (IoT) A Quick Start Guide: A to Z of IoT Essentials
From Everand
Internet of Things (IoT) A Quick Start Guide: A to Z of IoT Essentials
Chitra Lele
No ratings yet
Elective
No ratings yet
Elective
10 pages
ML1701 - NLP Notes Unit-1
No ratings yet
ML1701 - NLP Notes Unit-1
38 pages
T5 Presentation
No ratings yet
T5 Presentation
64 pages
Amer 2
No ratings yet
Amer 2
18 pages
NLP
No ratings yet
NLP
11 pages
Seminar Darshna
No ratings yet
Seminar Darshna
13 pages
Natural Language Processing Notes
No ratings yet
Natural Language Processing Notes
80 pages
NLP Module 1
No ratings yet
NLP Module 1
31 pages
NLP handwritten notes_copy
No ratings yet
NLP handwritten notes_copy
26 pages
Unit1 A
No ratings yet
Unit1 A
8 pages
Natural Language Processing
No ratings yet
Natural Language Processing
8 pages
DS Exp2 20101A0021 Satyam Mishra
No ratings yet
DS Exp2 20101A0021 Satyam Mishra
5 pages
Natural Language Processing in Python Master Data Science and Machine Learning for Spam Detection, Sentiment Analysis, Latent Semantic Analysis, And Article Spinning (Machine Learning in Python) by Un (Z-li
No ratings yet
Natural Language Processing in Python Master Data Science and Machine Learning for Spam Detection, Sentiment Analysis, Latent Semantic Analysis, And Article Spinning (Machine Learning in Python) by Un (Z-li
163 pages
Module-1 Introduction To NLP
No ratings yet
Module-1 Introduction To NLP
28 pages
Harambe University
No ratings yet
Harambe University
8 pages
What Is Natural Language Processing?
No ratings yet
What Is Natural Language Processing?
5 pages
Natural Language Processing
No ratings yet
Natural Language Processing
5 pages
DS Exp2 Rugved
No ratings yet
DS Exp2 Rugved
5 pages
Applied Deep Learning: Design and implement your own Neural Networks to solve real-world problems (English Edition)
From Everand
Applied Deep Learning: Design and implement your own Neural Networks to solve real-world problems (English Edition)
Dr. Rajkumar Tekchandani
No ratings yet
What Is NLP
No ratings yet
What Is NLP
16 pages
NLP unit 1 notes
No ratings yet
NLP unit 1 notes
15 pages
unit 3&4
No ratings yet
unit 3&4
10 pages
A Beginner's Introduction To Natural Language Processing (NLP)
100% (1)
A Beginner's Introduction To Natural Language Processing (NLP)
15 pages
Prompt Engineering - NLP and MLFoundations
No ratings yet
Prompt Engineering - NLP and MLFoundations
10 pages
Course Code HUM1012 Logic and Language Structure BL202425040 0921 D21+D22
No ratings yet
Course Code HUM1012 Logic and Language Structure BL202425040 0921 D21+D22
55 pages
Natural Language Processing
No ratings yet
Natural Language Processing
5 pages
Data Analysis Foundations with Python: Master Data Analysis with Python: From Basics to Advanced Techniques
From Everand
Data Analysis Foundations with Python: Master Data Analysis with Python: From Basics to Advanced Techniques
Cuantum Technologies LLC
No ratings yet
PDF Document 4
No ratings yet
PDF Document 4
5 pages
P-1.1.3
No ratings yet
P-1.1.3
9 pages
Natural Language Processing
100% (1)
Natural Language Processing
12 pages
NLP Notes
No ratings yet
NLP Notes
90 pages
Untitled document (1)
No ratings yet
Untitled document (1)
5 pages
Natural Language Processing
No ratings yet
Natural Language Processing
13 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
21 pages
Learning Advanced Programming
From Everand
Learning Advanced Programming
IT Campus Academy
No ratings yet
Anitha S. Pillai and Roberto Tedesco - Machine Learning and Deep Learning in Natural Language Processing-CRC Press (2024)
100% (2)
Anitha S. Pillai and Roberto Tedesco - Machine Learning and Deep Learning in Natural Language Processing-CRC Press (2024)
245 pages
Machine Learning Infrastructure and Best Practices for Software Engineers: Take your machine learning software from a prototype to a fully fledged software system
From Everand
Machine Learning Infrastructure and Best Practices for Software Engineers: Take your machine learning software from a prototype to a fully fledged software system
Miroslaw Staron
No ratings yet
Natural Language Processing Inside Pages 2
No ratings yet
Natural Language Processing Inside Pages 2
159 pages
Deep Learning for Data Architects: Unleash the power of Python's deep learning algorithms (English Edition)
From Everand
Deep Learning for Data Architects: Unleash the power of Python's deep learning algorithms (English Edition)
Shekhar Khandelwal
No ratings yet
Mini Project 3
No ratings yet
Mini Project 3
33 pages
NLP LectureNotes UNIT 1
No ratings yet
NLP LectureNotes UNIT 1
55 pages
b.e-cse-batchno-256
No ratings yet
b.e-cse-batchno-256
57 pages
Touchpad Prime Ver. 1.2 Class 8
From Everand
Touchpad Prime Ver. 1.2 Class 8
Nisha Batra
No ratings yet
Hands-on ML Projects with OpenCV
From Everand
Hands-on ML Projects with OpenCV
Mugesh S.
No ratings yet
Deep Learning with C#, .Net and Kelp.Net: The Ultimate Kelp.Net Deep Learning Guide
From Everand
Deep Learning with C#, .Net and Kelp.Net: The Ultimate Kelp.Net Deep Learning Guide
Matt R. Cole
No ratings yet
Kenya EA 2022 Presentation
No ratings yet
Kenya EA 2022 Presentation
15 pages
CH1
No ratings yet
CH1
87 pages
Learn AI with Python: Explore Machine Learning and Deep Learning techniques for Building Smart AI Systems Using Scikit-Learn, NLTK, NeuroLab, and Keras (English Edition)
From Everand
Learn AI with Python: Explore Machine Learning and Deep Learning techniques for Building Smart AI Systems Using Scikit-Learn, NLTK, NeuroLab, and Keras (English Edition)
Gaurav Leekha
5/5 (1)
Project Seminar
No ratings yet
Project Seminar
12 pages
What Is NLP?
No ratings yet
What Is NLP?
3 pages
Hands-on NumPy for Numerical Analysis: Unlock NumPy with Google Colab for High-Performance Numerical Computing and Optimizing Numerical Data Analysis (English Edition)
From Everand
Hands-on NumPy for Numerical Analysis: Unlock NumPy with Google Colab for High-Performance Numerical Computing and Optimizing Numerical Data Analysis (English Edition)
Rituraj Dixit
No ratings yet
Topic 2: Introduction To Natural Language Processing (NLP)
No ratings yet
Topic 2: Introduction To Natural Language Processing (NLP)
16 pages
NLP Application
No ratings yet
NLP Application
7 pages
Natural Language Processing (NLP) (A Complete Guide)
No ratings yet
Natural Language Processing (NLP) (A Complete Guide)
26 pages
Artificial Intelligence-UNIT-4
No ratings yet
Artificial Intelligence-UNIT-4
37 pages
Bhawini NLP Practical
No ratings yet
Bhawini NLP Practical
98 pages
Natural_Language_Processing (NLP)
No ratings yet
Natural_Language_Processing (NLP)
32 pages
Laiye RPA+AI Security Features
No ratings yet
Laiye RPA+AI Security Features
14 pages
ESP8266 - PighiXXX PDF
No ratings yet
ESP8266 - PighiXXX PDF
60 pages
CE-207 Computer Organization and Architecture - Batch 2019 - 04-07-2020
No ratings yet
CE-207 Computer Organization and Architecture - Batch 2019 - 04-07-2020
5 pages
Blockchain and AI Technology Convergence Applications in Transportation Systems
No ratings yet
Blockchain and AI Technology Convergence Applications in Transportation Systems
32 pages
MicroPitch User Guide
No ratings yet
MicroPitch User Guide
11 pages
Canto Cumulus: FR - Ant.Niedermayr GMBH & Co. KG Case Study
No ratings yet
Canto Cumulus: FR - Ant.Niedermayr GMBH & Co. KG Case Study
2 pages
Unit 2 Answers
No ratings yet
Unit 2 Answers
23 pages
Chandan V Nagaraja - Curriculum Vitae
No ratings yet
Chandan V Nagaraja - Curriculum Vitae
5 pages
ReleaseNote - FileList of X415EA - 2009 - X64 - V1.00
No ratings yet
ReleaseNote - FileList of X415EA - 2009 - X64 - V1.00
7 pages
Case Study-DNS
No ratings yet
Case Study-DNS
7 pages
Bus 108
No ratings yet
Bus 108
10 pages
KTU S4 CSE DBMS NOTES MODULE 2
No ratings yet
KTU S4 CSE DBMS NOTES MODULE 2
145 pages
REF611 PG 757468 ENb
No ratings yet
REF611 PG 757468 ENb
48 pages
4 - Streamfile
No ratings yet
4 - Streamfile
12 pages
Ge8151 Phython Prog Unit 4 New
No ratings yet
Ge8151 Phython Prog Unit 4 New
33 pages
SAP Background Job
No ratings yet
SAP Background Job
11 pages
23 Study Notes Computer PDF
No ratings yet
23 Study Notes Computer PDF
20 pages
conCM Tut
No ratings yet
conCM Tut
132 pages
Unit 1 - Introduction to .Net
No ratings yet
Unit 1 - Introduction to .Net
8 pages
User Manual: For Mac
No ratings yet
User Manual: For Mac
19 pages
Delphi 7
No ratings yet
Delphi 7
49 pages
Supercharged Point-to-Multipoint (PTMP) : Wi-Fi 6E-Based, 8x8, Beamforming, A6 Access Point
No ratings yet
Supercharged Point-to-Multipoint (PTMP) : Wi-Fi 6E-Based, 8x8, Beamforming, A6 Access Point
2 pages
R and RC Series Drive Manual v1.12
No ratings yet
R and RC Series Drive Manual v1.12
544 pages
Asad Irfan CV For CSR TeraData
No ratings yet
Asad Irfan CV For CSR TeraData
3 pages
Copy of MIT Manipal CV Template
No ratings yet
Copy of MIT Manipal CV Template
1 page
AS 1100.201-1992 Technical Drawing
No ratings yet
AS 1100.201-1992 Technical Drawing
81 pages
ID Information Assets Asset Owner
No ratings yet
ID Information Assets Asset Owner
30 pages
MOBILedit Forensic - MOBILedit
No ratings yet
MOBILedit Forensic - MOBILedit
12 pages
MCA Sem-2 OOCP, April-2012 - List of The Questions To Be Prepared For Final GTU Exam
No ratings yet
MCA Sem-2 OOCP, April-2012 - List of The Questions To Be Prepared For Final GTU Exam
7 pages
Web Intelligence & Big Data: By: Chanveer Singh Harmanaq Singh
No ratings yet
Web Intelligence & Big Data: By: Chanveer Singh Harmanaq Singh
46 pages

NLP - PBL - Project Report - Draft.02

Uploaded by

NLP - PBL - Project Report - Draft.02

Uploaded by

MINI PROJECT

(Project Based Learning)

NATURAL LAGUAGE PROCESSING

Peoples Empowerment Groups

Department of AI&ML Engineering

(Prof. Kirti Randhe) (P.K .Srivastava) (Prof. Kirti Randhe)

Finally I would like to thank all the Teaching, Non- Teaching

Natural language processing is the ability of a computer program to understand

The Timeline of NLP Evolution

Generally, there are three basic components that make up a legitimate

i.) Protocol: It is basically an identifier that determines what protocol

ii) Hostname: Also known as the resource name. It contains the IP

As per the figure, wisdomml.in.edu is the domain name. The top-level

Fig. 1. Components of a URL

A malicious URL or website generally contains different types of

The main objective of the malicious website is to fraud or steal the

In this article, we address the detection of malicious URLs as a multi-

In this case study, we address the detection of malicious URLs as a

As we know machine learning algorithms only support numeric inputs

Later, we will also compare their performance and plot average

In this case study, we will be using a Malicious URLs

• Benign URLs: These are safe to browse URLs. Some of

The word cloud helps in understanding the pattern of words/tokens in

It is one of the most appealing techniques of natural language

As we can see in the below figure word cloud of benign URLs is

The word cloud of malware URLs has higher frequency tokens

The defacement URLs’ intention is to modify the original website’s

So from the above output, we can observe that the dataset

Next, we will move towards the feature engineering part in which we

• having_ip_address: Generally cyber attackers use an IP

The google_index feature denotes if the URL is indexed in google

As we can observe from the above distribution

From the distribution of suspicious_urls, it is clear that benign URLs

As per the short_url distribution, we can observe that benign URLs

from sklearn.preprocessing import LabelEncoder

As we know we have an imbalanced dataset. The reason for this is

This stratify parameter makes a split so that the proportion of values

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,

The code for building machine learning models is shared below.

# Random Forest Model

So based on the above performance, we have selected Random Forest

In this article, we have demonstrated a machine learning approach to

• "Speech and Language Processing: An Introduction to Natural Language

• "Foundations of Statistical Natural Language Processing" by Christopher

• Towards Data Science (https://towardsdatascience.com): A platform with

• Natural Language Processing with Python (https://www.nltk.org/book/):

• Association for Computational Linguistics (ACL): One of the premier

• Conference on Empirical Methods in Natural Language Processing

You might also like