Analysis of User Behavior Patterns Using Machine Learning Algorithms
Analysis of User Behavior Patterns Using Machine Learning Algorithms
Harshitha Y J P Jayarekha
Dept. of ISE, Dept. of ISE
B.M.S. College of Engineering BMS College of Engineering
Bangalore, India Bangalore, India
[email protected] [email protected]
Abstract— A weblog is a dynamic record of transactions to eliminate errors and inconsistencies, user identification to
regularly updated by website visitors. It contains a variety of distinguish individual visitors, session identification to group
information, such as IP addresses, status codes, bytes sent, related actions, content retrieval to extract relevant
categories, and timestamps. The primary purpose of a weblog is information, and path completion to understand the flow of
to monitor user behavior and classify their interests based on
user interactions on the website.
different categories and qualities. This study aims to achieve two
main objectives: first, to categorize successful responses, and By leveraging web mining techniques, particularly web
second, to distinguish between normal and abnormal user usage mining, web content mining, and web structure mining,
behavior. The research process outlined in this paper involves researchers and they gain valuable insights into user
several steps. It begins with data collection, where relevant
information is gathered from the weblogs. The next step is data
behaviour, preferences, and interests. Understanding user
pre-processing, which involves cleaning and organizing the data navigation patterns and behaviour can significantly improve
to make it suitable for analysis. The researchers then employ the design of online services, enhance user experience, and
clustering techniques to identify different patterns of user provide more relevant and personalized content to users.
activity, enabling a quick assessment and prediction of user
behavior.The main focus of the research is user prediction,
There are several benefits to predicting typical or aberrant
achieved by analyzing user preferences extracted from the behaviour in online applications and websites, which can
weblogs at various levels. To accomplish this, the researchers improve security, improve the user experience, and optimise
utilize various machine learning techniques. One of the speed.
implemented models is the Random Forest Classifier, which is
utilized to predict user behavior based on specific input Web server activity prediction methods include online
parameters. and offline elements. Online analysis uses the weblog in real-
time, utilising user intuition lists, whereas offline analysis
Keywords— Weblog, Machine learning, clustering uses past data, such as log files or downloaded weblogs, for
analysis. Historical weblog analysis of user behaviour reveals
I. INTRODUCTION trends of web navigation. Web traversal patterns can be
In today's digital age, consumers have access to an broken down into frequent and semi-frequent sequences
abundance of online information. However, sifting through using classifiers, which can provide information about client
this mass of information to find relevant and valuable content preferences.
has evolved more and more challenging. Analysing and
For our work we consider this list as data and use it for
modeling web navigation behaviour can be beneficial in
analysis. We perform a comparative analysis on different
understanding user preferences and requests for online
Machine learning techniques on our dataset to check how true
services. Web mining, a data mining technique, plays an
the data is based on Root mean square error and R2 score. We
important role in collecting and analysing significant data
will also implement Random Forest classifier Machine
from web data. It comprises three primary subfields: web
Learning Model to identify the behavior of User.
content mining, web structure mining, and web usage mining,
each specializing in various types of data. II. REVIEW OF LITERATURE
Web content mining involves collecting data from [1]. Despite the ability to collect vast amounts of logging
different online resources, like text, images, audio, and videos data from each encounter, security experts face challenges in
available on the internet. Web structure mining focuses on identifying attacks. The system allows a predefined set of
studying the link architecture of websites to uncover activities, such as system calls in any operating system or
meaningful patterns and relationships among web pages. On actions like "Search Item" and "Filter Results" in online
the other hand, web usage mining delves into online user stores. Interaction logs can be leveraged to develop
activities, including the analysis of weblogs, which offer automated security models that protect against invasions.
insightful information into user behaviour. Weblogs record Empirical evidence demonstrates that informed modelling
the actions of each website visitor, enabling the prediction of effectively captures typical behaviour, enabling the
user behaviour. identification of abnormal conduct. The dataset spans 31 days
and consists of over 15,000 sessions conducted by 1,400
However, weblogs are typically unstructured,
users, involving nearly 300 different actions. To address this,
necessitating data pre-processing before meaningful analysis
we propose a strategy that employs machine learning
can be performed. Data pre-processing involves transforming
techniques, specifically LSTM neural networks, to simulate
the raw weblog data into a processed format that reveals user
typical system interaction behaviour. The suggested
navigation patterns. This includes steps such as data cleansing
979-8-3503-0692-7/23/$31.00 ©2023 IEEE
Authorized licensed use limited to: Georgia State University. Downloaded on May 14,2024 at 19:19:19 UTC from IEEE Xplore. Restrictions apply.
methodology is evaluated using a dataset that includes [7] This study introduces a concept called multi-
interaction logs from a login and security server administrator dimensional semantic activity space, where user behaviour
interface. features are aggregated and represented as vectors. By
analysing action data from log files across different
[2]The WE dataset is derived from a particular user's
subsystems in specific domains, the researchers identify user
anonymous online browsing behaviour using a search engine.
behaviour patterns. Experimental results strongly support the
It comprises a sequence of activities, including
efficiency of the proposed approach in detecting variations in
"ActionSearchUser," "ActionDisplayUser," and more. To
typical behavioural characteristics of participants across
analyse this dataset, several algorithms were evaluated, such
different domains. These variations include patterns related
as K-Nearest Neighbour (KNN), Naive Bayes (NB), Support
to resource access, operational tasks, and performance
Vector Machine, and K-means clustering. According to the
evaluation. Overall, the study demonstrates the potential of
findings, the Naive Bayes algorithm stands out with an
this approach to gain meaningful insights into user behaviour
impressive prediction accuracy of around 90.4%.
across various domains.
[3] This study centres on effectively classifying user
[8] In this paper, researchers employ Big Data Analytics
behaviour by utilizing keystroke dynamics for authentication.
and Machine Learning algorithms to predict whether users
The behavioural biometrics of users are recorded, and
are legitimate or malicious based on the Application Layer
machine learning principles are employed to categorize them.
logs generated by their browsing patterns. The system
The study involved gathering anonymized data from 94 users
processes real-time data sourced from their internet-launched
to verify their identities. Classification was based on the
application, involving over 10 million lines of application-
events of button presses and action timestamps. Among the
layer logs for analysis. Among the machine learning
classifiers used, the SVM, RBF classifier exhibited the
algorithms used, the Random Forest Algorithm achieves
highest performance metrics in terms of classification.
higher accuracy in the prediction task. This algorithm
Additionally, grid search optimization was employed to find
demonstrates superior performance in distinguishing between
out the optimal values for the RBF kernel.
legitimate and malicious users based on the analysed
[4] A model for a behaviour-based anomaly detection browsing patterns recorded in the Application Layer logs.
system from the Android device has been developed using The findings emphasize the effectiveness of using Big Data
machine learning. The goal of this system is to identify Analytics and the Random Forest Algorithm to detect
malware vulnerabilities on the actions performed by mobile potential malicious activities and classify user behaviour.
applications. Three machine learning algorithms were
deployed in this system: K-Nearest Neighbour (KNN), Naïve III. METHODOLOGY
Bayes, and a decision tree method. Among these algorithms, Dataset Preprocessing Knowledge
KNN demonstrated the highest accuracy in determining
mobile application behaviour within the system, providing Base
the most accurate results.
[5] The researchers propose an innovative ensemble
hybrid machine learning strategy to identify additive outliers
in behaviour patterns based on their spatial-temporal
properties. This approach combines Multi-State Long Short- Identify User Classifier Clustering
Term Memory and Convolutional Neural Networks for time Behavior
series anomaly detection. By experimenting, they found that
utilizing Multistate LSTM outperforms using a single-state
basic LSTM model. To evaluate its effectiveness, the model Fig.1 presents a comprehensive framework designed for analysing user
is trained on publicly available datasets for insider threats. behaviour.
The results demonstrate the success the proposed model with
Multistate LSTM in detecting insider threats, achieving high
Figure 1 shows the comprehensive framework for
Area Under the Curve scores of 0.9042 on the training data
analysis of user behaviour pattern in web. It consists of five
and 0.9047 on the test data, indicating its accuracy in
phases.
identifying anomalous behaviour patterns related to insider
threats. •weblogs
[6] The paper focuses on evaluating user behaviour in a •Pre-processing
distributed computing environment using ml algorithms. The •knowledge base
primary goal is to distinguish closely related user groups that •Clustering
exhibit similar behaviour patterns. The researchers record and •Classifier
store behaviour-related events in a database for analysis.
Three ml algorithms, namely K-Nearest Neighbour (KNN), The Kaggle website is where the data is gathered. Web
Naive Bayes, and a decision tree method, were employed. logs, which are records or files that record and preserve
The evaluation shows that the decision tree method provides information about activities and interactions that take place
higher accuracy compared to the other two algorithms, on a website, are contained in the weblog. Web servers
making it more efficient in accurately discriminating between automatically produce web logs, also known as "web server
closely related user groups based on their behaviour patterns. logs" or "access logs," as users access and engage with the
online pages and resources stored on the server.
Authorized licensed use limited to: Georgia State University. Downloaded on May 14,2024 at 19:19:19 UTC from IEEE Xplore. Restrictions apply.
A technique called pre-processing is used on the data set. better fit of the model to the data. The formula for RMSE is
The actions taken to prepare and modify raw data into a calculated as the square root of the mean squared error
format that can be efficiently utilised by ml algorithms are between the actual and predicted values.
referred to as pre-processing in machine learning. The
R-squared (R2) is another metric used to evaluate how
effectiveness of machine learning models is greatly
well the model's predictor variables account for the variance
influenced by the quality and relevance of the data; therefore
in the response variable. This value ranges between 1 and 0,
pre-processing is a vital component of the overall data
with a higher R2 score indicating a better fit of the model.
preparation process.
The relationship between the predictor variables and the
A knowledge base is a centralised database or repository
response variable is established through models, and the
used to contain structured and organised knowledge or data
process of fitting the model involves determining how
about a certain field. It is useful tool for gathering, keeping,
effectively it can predict the value of the response variable
and disseminating knowledge within a company or for
based on the predictor variables.
general accessibility. Businesses, educational institutions,
customer support teams, or any other entity that deals with a Throughout the study, we apply various classification
significant volume of information can develop and uses machine learning algorithms to the dataset to assess their
knowledge base. performance. By analyzing the results, we aim to identify the
algorithm for our specific dataset and use it to predict whether
The clustering technique involves assembling related data
a behavior is normal or abnormal based on user inputs.
points based on their intrinsic patterns or similarities. It is an
unsupervised learning technique; therefore, no labels or The model applied in this study includes:
predetermined classes are necessary.
1. Logistic Regression
In machine learning, a classifier is a model or algorithm 2. Random Forest
that discovers patterns and relationships using labelled 3. Decision Tree
training data to make predictions or assign class labels to
4. Ada Boost
hidden or unlabelled data points. Since it uses supervised
learning, a labelled dataset with input characteristics and 5. Gradient Boost
corresponding target labels is necessary. 6. KNN
7. Voting Classifier
In this part we discuss about the methods we used in each
step of our analysis. 8. Light GBM
A. COLLECTION OF DATASETS: RMSE and R2 metrics were used with these models to
The Weblog dataset was obtained from the Kaggle evaluate how well the model fits the dataset.
website. Weblog contains the following details. IV. RESULT AND ANALYSIS
• inter_api_access_duration(sec) Algorithms Applied:
• api_access_uniqueness
• sequence_length(count) A. Logistic Regression
• vsession_duration(min) Logistic regression is a statistical method used for binary
• num_sessions classification tasks, where the goal is to predict the likelihood
• num_users of an event belonging to one of two classes (typically labeled
• num_unique_apis, as 0 and 1). It involves employing a logistic function to model
• source the connection between the input features (independent
• classification variables or predictors) and the probability of the binary
B. Cleaning the dataset outcome.
The data is cleaned and prepared to help plot graphs. To
plot the graphs, first extra labels such as “_ID” were removed LOGISTIC REGRESSION
from the dataset. Using python, labels such as “Sl.no” were
1.2
removed to train the model. The dataset was split into training 0.97
1 0.91
set which is of 70% and testing set which is of 30%.
0.8
Value
Authorized licensed use limited to: Georgia State University. Downloaded on May 14,2024 at 19:19:19 UTC from IEEE Xplore. Restrictions apply.
In Figure 2, a Logistic Regression model achieves an R- and makes accurate predictions (low RMSE), are the
squared value of 0.91, indicating a high explanatory power desirable characteristics for a well-performing model.
where 91% of the variance is explained. Additionally, the
model exhibits a low (RMSE) of 0.14, which reflects accurate
predictions, making it a well-performing model.
B. Decision Tree
Random Forest
A decision tree classifier is a specific type of decision tree
method used for classification tasks. It falls under the 1.2
category of supervised learning and is employed to predict 0.99 0.99
categorical class labels for examples based on their 1
characteristics. The main motive is to divide the data into 0.8
purest subsets or minimize impurity (e.g., Gini impurity or
Value
entropy) by selecting the best feature and corresponding 0.6
0.44
threshold at each node. This process is same for both internal
0.4
and leaf nodes till certain stopping criteria are met, such as
reaching a maximum depth, having a minimum number of 0.2
samples in a node, or all instances in a node belonging to the
same- class. 0
R-SQUARED RMSE ACCUARACY
VALUE
Decision Tree Metrices
1.2 Fig.4 Graphical Representation of Metrices of Random Forest
1 1
1
0.8 D. AdaBoost
Value
Authorized licensed use limited to: Georgia State University. Downloaded on May 14,2024 at 19:19:19 UTC from IEEE Xplore. Restrictions apply.
E. Gradient Boost
Gradient Boosting, like AdaBoost, combines numerous Light GBM
weak learners (often decision trees) to build a powerful and 1
accurate classifier. The essential distinction, though, is in how 0.8
it constructs the ensemble of weak learners. 0.8
VALUE
Gradient Boosting constructs the ensemble in a stepwise 0.6
0.43
fashion, with each weak learner trained to rectify the faults 0.4
caused by the prior learners. It applies a gradient descent 0.205
optimization approach to minimize the loss function, which 0.2
assesses the variation between the actual and anticipated class 0
labels. R-SQUARED RMSE ACCUARACY
VALUE
1 0.92
0.5
boosting designed for classification tasks. Its primary
0.4
objective is to design a predictive model capable of
0.3 0.2
classifying data examples into multiple groups or classes
0.2
based on their respective attributes. 0.1
Light GBM utilizes gradient boosting, which is a widely 0
used ensemble learning technique. accurate predictions. R-SQUARED RMSE ACCUARACY
VALUE
METRICES
Authorized licensed use limited to: Georgia State University. Downloaded on May 14,2024 at 19:19:19 UTC from IEEE Xplore. Restrictions apply.
H. Voting Classifier
The Voting Classifier Combines predictions from Comparision
multiple individual classifiers, also known as base or Voting… 0.205 0.43 0.8
component classifiers, to make the final prediction. The KNN 0.82 0.2 0.95
primary concept behind the Voting Classifier is to harness the Light GBM 0.205 0.43 0.8
collective knowledge of multiple classifiers, utilizing their
Gradient Boost 0.69 0.27 0.92
respective strengths and compensating for their individual
weaknesses. This approach often leads to improved overall AdaBoost 0.86 0.18 0.96
performance and more robust predictions compared using Decision Tree 1 0 1
classifier. Random Forest 0.99 0.44 0.99
Logistic… 0.91 0.14 0.97
Algorithms
The R2 score of 0.20 suggests that the Voting Classifier
model is capturing the underlying relationships in data very 0% 50% 100%
well. It might be underperforming compared to what is Values
expected from a good model. The R2 score of 0.20 suggests R2 Score RMSE Accuracy
that the Voting Classifier model is capturing the underlying
relationships in data very well. It is underperforming
Fig.10. Graphical Representation of Metrices of algorithms
compared to what is expected from a good model.
As shown in Table 10. and Fig 10. Random Forest,
Logistic Regression and Decision Tree are the best fit models
our Data Set, since they have high R2 score and low RMSE
Voting Classifier value. when a model has both a high R-squared score and a
0.9 0.8 low RMSE value, it suggests that the model is a good fit to
0.8
0.7 the data, explains a significant proportion of the variance in
0.6 the dependent variable, and provides accurate predictions.
VALUE
0.5 0.43
We have implemented a Random Forest classifier
0.4
machine learning model for predicting behavior based on
0.3 0.205
0.2
some input parameters, it predicts whether the behavior of the
0.1 user is malicious or normal. A function is defined.
0 The function starts by collecting input values from the
R-SQUARED RMSE ACCUARACY user using the . Each input corresponds to a specific feature
VALUE for the prediction. The code then reads a dataset from a CSV
METRICES file . It separates the features (X) and the target variable (y).
Next, it splits the data into training and testing sets. The
Fig 9. Graphical Representation of Metrices of Voting Classifier
training set is used for training the model.
I. Comparison of metrics of all Algorithms
The user input values are converted into a list named
TABLE 1. COMPARISON TABLE FOR MATRIX OF ALGORITHMS x_text and then transformed into a array. This is to match the
shape expected by the classifier.
Algorithms R2 Score RMSE Accuracy
Based on the predicted result, a message (msg) is assigned
Logistic Regression 0.91 0.14 0.97
either "Normal Behavior" or "Abnormal Behavior".
Random Forest 0.99 0.44 0.99
Decision Tree 1 0 1
Voting Classifier 0.205 0.43 0.8 Fig 11. Normal Behavior Prediction
Authorized licensed use limited to: Georgia State University. Downloaded on May 14,2024 at 19:19:19 UTC from IEEE Xplore. Restrictions apply.
Technology (GECOST), Miri Sarawak, Malaysia, 2022, pp. 314-319,
doi: 10.1109/GECOST55694.2022.10010386.
[13] S. Dong, Y. Xia and T. Peng, "Network Abnormal Traffic Detection
Model Based on Semi-Supervised Deep Reinforcement Learning," in
IEEE Transactions on Network and Service Management, vol. 18, no.
4, pp. 4197-4212, Dec. 2021, doi: 10.1109/TNSM.2021.3120804.
[14] M. S. Ashraf, F. Rehman, H. Sharif, M. Aqeel, M. Arslan and A. Rida,
"Spam Consumer’s Reviews Detection for E-Commerce Website using
Linguistic Approach in Deep Learning," 2022 3rd International
Conference on Innovations in Computer Science & Software
Engineering (ICONICS), Karachi, Pakistan, 2022, pp. 1-7, doi:
10.1109/ICONICS56716.2022.10100351.
Fig 12. Abnormal Behavior Prediction [15] C. H. Sumanth, P. P. Kalyan, B. Ravi and S. Balasubramani., "Analysis
As shown in Fig 11 and Fig 12. Based on the input weblog of Credit Card Fraud Detection using Machine Learning Techniques,"
2022 7th International Conference on Communication and Electronics
values from the user the Random Classifier model predicts Systems (ICCES), Coimbatore, India, 2022, pp. 1140-1144, doi:
whether the user behavior is normal or Malicious. 10.1109/ICCES54183.2022.9835751.
V. CONCLUSION
In summary, the dataset used in this study was sourced
from the Kaggle website. The data underwent preparation and
preprocessing, and various ML algorithms were applied to
generate graphs and calculate R-squared scores and RMSE
for comparison. The utilization of machine learning
techniques allowed for accurate analysis of the dataset.
Additionally, the implementation of the Random Forest
classifier algorithm model enabled the prediction of whether
a user's behavior is normal or abnormal.
REFERENCES
[1] L. Adilova, L. Natious, S. Chen, O. Thonnard and M. Kamp, "System
Misuse Detection Via Informed Behavior Clustering and Modeling,"
2019 49th Annual IEEE/IFIP International Conference on Dependable
Systems and Networks Workshops (DSN-W), Portland, OR, USA,
2019, pp. 15-23, doi: 10.1109/DSN-W.2019.00011
[2] Ashwini, K Viswavardhan Reddy“Predicting the User Behavior
Analysis using Machine Learning Algorithms.” International Research
Journal of Engineering and Technology
[3] Krishnamoorthy, Sowndarya, Rueda, Luis,Saad, Sherif, Elmiligi,
Haytham, 2018 “Identification of User Behavioral Biometrics for
Authentication using Keystroke Dynamics and Machine Learning”
[4] S. Vanjire and M. Lakshmi, "Behavior-Based Malware Detection
System Approach For Mobile Security Using Machine Learning," 2021
International Conference on Artificial Intelligence and Machine Vision
(AIMV), Gandhinagar, India, 2021, pp. 1-4, doi:
10.1109/AIMV53313.2021.9671009.
[5] Singh, Malvika,Mehtre, B.M.S Sangeetha ,2019/01/01, “User Behavior
Profiling using Ensemble Approach for Insider Threat “
[6] Callara, Matias,Wira, Patrice, 2018/11/01,“User Behavior Analysis
with Machine Learning Techniques in Cloud Computing Architecture”
[7] Y. Tao, S. Guo, C. Shi and D. Chu, "User Behavior Analysis by Cross-
Domain Log Data Fusion," in IEEE Access, vol. 8, pp. 400-406, 2020,
doi: 10.1109/ACCESS.2019.2961769.
[8] Rohit Ranjan, Shashi Shekhar Kumar , Volume 2, Issue 1, March
2022, 100034 .”User behavior analysis using data analytics and
machine learning to predict malicious user versus legitimate”
[9] D. F. Galletta, R. Henry, S. McCoy, and P. Polak, “When the Wait
Isn’t So Bad: The Interacting Effects of Website Delay, Familiarity,
and Breadth, ” Information Systems Research, vol. 17, no. 1, pp. 20-
37, 2006.
[10] J. Palmer, “Web Site Usability, Design, and Performance Metrics,
”Information Systems Research, vol. 13, no. 2, pp. 151-167, 2002.
[11] Y. Chen and W. Liu, "The Sentiment Attitude of Weibo Users towards
Annual Individual Income Tax Return: Based on Natural Language
Processing and Machine Learning Methods," 2023 IEEE 6th
International Conference on Big Data and Artificial Intelligence
(BDAI), Jiaxing, China, 2023, pp. 67-72, doi:
10.1109/BDAI59165.2023.10256913.
[12] A. Saleem Raja, B. Sundarvadivazhagan, R. Vijayarangan and S.
Veeramani, "Malicious Webpage Classification Based on Web Content
Features using Machine Learning and Deep Learning," 2022
International Conference on Green Energy, Computing and Sustainable
Authorized licensed use limited to: Georgia State University. Downloaded on May 14,2024 at 19:19:19 UTC from IEEE Xplore. Restrictions apply.