NLP - PBL - Project Report - Draft.02
NLP - PBL - Project Report - Draft.02
Group Member:
ARYAN SAHANI
AMEYA CHHATRE
KAPIL MANKOSKAR
With the increasing prevalence of online threats and cyber-attacks, there is a pressing need
for effective methods to detect and mitigate malicious links shared on the internet. This
abstract presents a novel approach to address this challenge by leveraging the power of
machine learning and natural language processing (NLP) techniques.
The proposed malicious link detector utilizes a combination of supervised learning
algorithms and advanced NLP models to analyze the textual content and context of links,
aiming to classify them as either malicious or benign. The system operates in real-time,
providing prompt identification and mitigation of potential security risks.
The detection process involves several key stages. Initially, a comprehensive dataset of
labeled examples is collected, consisting of both malicious and benign links. Features are
extracted from these examples, capturing relevant information such as URL structure, domain
reputation, and content characteristics.
Next, a supervised learning algorithm, such as a support vector machine (SVM) or a random
forest classifier, is trained on the labeled dataset. These algorithms learn to differentiate
between malicious and benign links based on the extracted features. To enhance the system's
performance, state-of-the-art NLP models, such as transformer-based architectures like BERT
or GPT, are incorporated to capture semantic relationships and contextual information present
in the link's text.
During the inference phase, the trained model applies the learned patterns and rules to unseen
links, assessing their maliciousness probability. By examining the textual content and
leveraging NLP techniques, the system can identify suspicious patterns, phishing attempts,
malware propagation, or other forms of malicious intent.
To evaluate the effectiveness of the proposed malicious link detector, extensive experiments
are conducted using diverse datasets comprising both known malicious links and legitimate
URLs. The system's performance is measured in terms of metrics such as accuracy, precision,
recall, and F1 score.
while keeping false positives at a minimum. The proposed system holds significant promise
in strengthening cybersecurity defenses, assisting internet users, and safeguarding The results
demonstrate the efficacy of the machine learning and NLP-based approach in accurately
identifying malicious links, achieving high detection rates against potential threats in real-
time.
Overall, this abstract outlines a novel approach for detecting malicious links by harnessing
the power of machine learning and NLP techniques. The integration of supervised learning
algorithms and advanced NLP models provides an effective means of identifying and
mitigating security risks, ultimately contributing to enhanced online safety and protecting
users from cyber threats
Contents
• Introduction------------------------------------------6
• URL---------------------------------------------------8
• Malicious URL--------------------------------------9
• Problem Statement----------------------------------10
• Project flow------------------------------------------11
• Dataset description----------------------------------11
• Wordclouds-------------------------------------------12
• Libraries-----------------------------------------------12
• Loading dataset---------------------------------------12
• Feature engineering----------------------------------13
• EDA----------------------------------------------------20
• Label encoding----------------------------------------21
• Segregating features & target variable-------------22
• Traing & testing---------------------------------------22
• Model building----------------------------------------23
• Feature importance------------------------------------25
• Model prediction---------------------------------------26
• Conclusion
• References
CHAPTER1: Introduction
iii) Path: It specifies the actual path where the resource is located
Problem statement
Project flow
Dataset description
Now, let’s discuss different types of URLs in our dataset i.e., Benign,
Malware, Phishing, and Defacement URLs.
import pandas as pd
import itertools
from sklearn.metrics import classification_report,confusion_matrix,
accuracy_score
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import xgboost as xgb
from lightgbm import LGBMClassifier
import os
import seaborn as sns
from wordcloud import WordCloud
Next, we will load the dataset and will check sample records in the
dataset to get an understanding of the data.
CHAPTER6: Loading dataset
In this step, we will import the dataset using the pandas library and
check the sample entries in the dataset.
df=pd.read_csv('malicious_phish.csv')
print(df.shape)
df.head()
import re
#Use of IP or not in domain
def having_ip_address(url):
match = re.search(
'(([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-
5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.'
'([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\/)|' # IPv4
'((0x[0-9a-fA-F]{1,2})\\.(0x[0-9a-fA-F]{1,2})\\.(0x[0-9a-fA-
F]{1,2})\\.(0x[0-9a-fA-F]{1,2})\\/)' # IPv4 in hexadecimal
'(?:[a-fA-F0-9]{1,4}:){7}[a-fA-F0-9]{1,4}', url) # Ipv6
if match:
# print match.group()
return 1
else:
# print 'No matching pattern found'
return 0
df['use_of_ip'] = df['url'].apply(lambda i: having_ip_address(i))
from urllib.parse import urlparse
def abnormal_url(url):
hostname = urlparse(url).hostname
hostname = str(hostname)
match = re.search(hostname, url)
if match:
# print match.group()
return 1
else:
# print 'No matching pattern found'
return 0
df['abnormal_url'] = df['url'].apply(lambda i: abnormal_url(i))
#pip install googlesearch-python
from googlesearch import search
def google_index(url):
site = search(url, 5)
return 1 if site else 0
df['google_index'] = df['url'].apply(lambda i: google_index(i))
def count_dot(url):
count_dot = url.count('.')
return count_dot
df['count.'] = df['url'].apply(lambda i: count_dot(i))
def count_www(url):
url.count('www')
return url.count('www')
df['count-www'] = df['url'].apply(lambda i: count_www(i))
def count_atrate(url):
return url.count('@')
df['count@'] = df['url'].apply(lambda i: count_atrate(i))
def no_of_dir(url):
urldir = urlparse(url).path
return urldir.count('/')
df['count_dir'] = df['url'].apply(lambda i: no_of_dir(i))
def no_of_embed(url):
urldir = urlparse(url).path
return urldir.count('//')
df['count_embed_domian'] = df['url'].apply(lambda i:
no_of_embed(i))
def shortening_service(url):
match =
re.search('bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.
im|is\.gd|cli\.gs|'
'yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|sn
ipurl\.com|'
'short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.co
m|fic\.kr|loopt\.us|'
'doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\
.co|lnkd\.in|'
'db\.tt|qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|
ity\.im|'
'q\.gs|is\.gd|po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u
\.bb|yourls\.org|'
'x\.co|prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1
url\.com|tweez\.me|v\.gd|'
'tr\.im|link\.zip\.net',
url)
if match:
return 1
else:
return 0
df['short_url'] = df['url'].apply(lambda i: shortening_service(i))
def count_https(url):
return url.count('https')
df['count-https'] = df['url'].apply(lambda i : count_https(i))
def count_http(url):
return url.count('http')
df['count-http'] = df['url'].apply(lambda i : count_http(i))
def count_per(url):
return url.count('%')
df['count%'] = df['url'].apply(lambda i : count_per(i))
def count_ques(url):
return url.count('?')
df['count?'] = df['url'].apply(lambda i: count_ques(i))
def count_hyphen(url):
return url.count('-')
df['count-'] = df['url'].apply(lambda i: count_hyphen(i))
def count_equal(url):
return url.count('=')
df['count='] = df['url'].apply(lambda i: count_equal(i))
def url_length(url):
return len(str(url))
#Length of URL
df['url_length'] = df['url'].apply(lambda i: url_length(i))
#Hostname Length
def hostname_length(url):
return len(urlparse(url).netloc)
df['hostname_length'] = df['url'].apply(lambda i: hostname_length(i))
df.head()
def suspicious_words(url):
match =
re.search('PayPal|login|signin|bank|account|update|free|lucky|service|b
onus|ebayisapi|webscr',
url)
if match:
return 1
else:
return 0
df['sus_url'] = df['url'].apply(lambda i: suspicious_words(i))
def digit_count(url):
digits = 0
for i in url:
if i.isnumeric():
digits = digits + 1
return digits
df['count-digits']= df['url'].apply(lambda i: digit_count(i))
def letter_count(url):
letters = 0
for i in url:
if i.isalpha():
letters = letters + 1
return letters
df['count-letters']= df['url'].apply(lambda i: letter_count(i))
# pip install tld
from urllib.parse import urlparse
from tld import get_tld
import os.path
#First Directory Length
def fd_length(url):
urlpath= urlparse(url).path
try:
return len(urlpath.split('/')[1])
except:
return 0
df['fd_length'] = df['url'].apply(lambda i: fd_length(i))
#Length of Top Level Domain
df['tld'] = df['url'].apply(lambda i: get_tld(i,fail_silently=True))
def tld_length(tld):
try:
return len(tld)
except:
return -1
df['tld_length'] = df['tld'].apply(lambda i: tld_length(i))
So, now after creating the above 22 features, the dataset looks like the
below.
Now, in the next step, we drop the irrelevant columns
i.e., URL,google_index, and tld.
The reason for dropping the URL column is that we have already
extracted relevant features from it that can be used as input in
machine learning algorithms.
The tld column is dropped because it is the indirect textual column as
for finding the length of the top-level domain we have created tld
column.
After that, the most important step is to label and encode the target
variable (type) so that it can be converted into numerical categories
0,1,2, and 3. As machine learning algorithms only understand numeric
target variable.
So, in the next step, we have created a predictor and target variable.
Here predictor variables are the independent variables i.e., features of
URL, and target variable type.
#Predictor Variables
# filtering out google_index as it has only 1 value
X = df[['use_of_ip','abnormal_url', 'count.', 'count-www', 'count@',
'count_dir', 'count_embed_domian', 'short_url', 'count-https',
'count-http', 'count%', 'count?', 'count-', 'count=', 'url_length',
'hostname_length', 'sus_url', 'fd_length', 'tld_length', 'count-digits',
'count-letters']]
#Target Variable
y = df['type_code']
CHAPTER10: Training & Test Split
The next step is to split the dataset into train and test sets. We have
split the dataset into 80:20 ratio i.e., 80% of the data was used to train
the machine learning models, and the rest 20% was used to test the
model.
From the above result, it is evident that Random Forest shows the
best performance in terms of test accuracy as it attains the highest
accuracy of 96.6% with a higher detection rate for benign,
defacement, phishing, and malware.
feat_importances = pd.Series(rf.feature_importances_,
index=X_train.columns)
feat_importances.sort_values().plot(kind="barh",figsize=(10, 6))
From the above plot, we can observe that the top 5 features for
detecting malicious URLs are hostname_length, count_dir, count-
www, fd_length, and url_length.
CHAPTER13: Model prediction
In this final step, we will predict malicious URLs using our best-
performed model i.e., Random Forest.
The code for predicting raw URLs using our saved model is given
below:
def main(url):
status = []
status.append(having_ip_address(url))
status.append(abnormal_url(url))
status.append(count_dot(url))
status.append(count_www(url))
status.append(count_atrate(url))
status.append(no_of_dir(url))
status.append(no_of_embed(url))
status.append(shortening_service(url))
status.append(count_https(url))
status.append(count_http(url))
status.append(count_per(url))
status.append(count_ques(url))
status.append(count_hyphen(url))
status.append(count_equal(url))
status.append(url_length(url))
status.append(hostname_length(url))
status.append(suspicious_words(url))
status.append(digit_count(url))
status.append(letter_count(url))
status.append(fd_length(url))
tld = get_tld(url,fail_silently=True)
status.append(tld_length(tld))
return status
# predict function
def get_prediction_from_url(test_url):
features_test = main(test_url)
# Due to updates to scikit-learn, we now need a 2D array as a
parameter to the predict function.
features_test = np.array(features_test).reshape((1, -1))
pred = lgb.predict(features_test)
if int(pred[0]) == 0:
res="SAFE"
return res
elif int(pred[0]) == 1.0:
res="DEFACEMENT"
return res
elif int(pred[0]) == 2.0:
res="PHISHING"
return res
elif int(pred[0]) == 3.0:
res="MALWARE"
return res
# predicting sample raw URLs
urls =
['titaniumcorporate.co.za','en.wikipedia.org/wiki/North_Dakota']
for url in urls:
print(get_prediction_from_url(url))
CHAPTER14: Conclusion