BE Project
BE Project
Department of Computer
Engineering
Under the Supervision of: Submitted by:
• Machine learning models build baselines of normal behavior for each user and host by looking at
historical activity and comparisons within peer groups.
• User Behavior Analytics is a powerful approach to detecting threats inside an organization and
empowering analysts with new threat hunting capabilities.
Problem Definition
1 User and Entity Behaviour 2016,IEEE •UEBA modules that track and monitor behaviours of users,
Analytics for Enterprise IP addresses and devices in an enterprise.
Security •Anomalous behaviour is automatically detected using
machine learning algorithms based on Singular Values
Decomposition (SVD) and Mahalanobis Distance
2 Role of User and Entity 2020,IEEE •First level is based on policy violations and pre-recognized
Behavior Analytics in attacks.
Detecting Insider Attacks •Second alert level is based on threshold anomalies.
•Third level is based on deviation-based anomalies calculated
using Mahalanobis distance, standard deviation and covariance
matrix.
3 Secure because Math: A 2016,IEEE •Author reviews the research in MLsec (Network
deep-dive onMachine Monitoring, SIEM and Log Management)
Learning-based •Capabilities in MLsec (Classification, Anomaly Detection)
Monitoring
4 Automated Insider Threat 2017,IEEE • Tree-structure profiling approach for every user
Detection System Using and and every activity
Role-Based Profile • One stage of alert for policy violations or previously
Assessment known attacks and 2nd stage of alert for Threshold-
based anomalies and Deviation based anomalies.
5 Measuring the Effectiveness of 2021, Journal of • UEBA setup baseline for each and every entity
User and Entity Behaviour Xi'an University for evaluation
Analytics for the Prevention of of Architecture & • For model building SVD supervised algorithm
Insider Threats Technology used.
6 Insider Threat Detection using 2020,IEEE • Explain why UEBA necessary for organisation.
Deep Learning: A Review • Different Deep Learning architectures have been
discussed for
8 Detecting Insider Threats Using 2017 ,IEEE • Uses radish technology for real time anomaly
RADISH: A System for Real-Time detection
Anomaly Detection in • Which considers log files, Logon/Logoff events,
Heterogeneous Data Streams LDAP directory for data set
• Uses KNN algorithm for detection
Gap Identification
• UEBA tools are effective when rapid changes happen, when changes
happen slowly then performance is poor.
• Deep Learning MBS model has enhanced accuracy over other model.
Project Scope
Methodology
❑Below mentioned Methodology and Algorithm have been used in this work:
1. MBS Design
2. Feature Extraction
3. convLSTM Algorithm
1. MBS Design
❑ We propose a Multimodel-Based System (MBS) for anomaly detection of user
behaviors.
❑ To achieve anomaly detection through MBS, three primary models, a model for
Action Features, a model for Action Sequence, and a model for Role Features,
are designed in an MBS.
❑ We designed this system by considering that users who are in the same group
always have the same jobs.
❑ Role features can be extracted from these groups through their daily jobs,
behaviors, and other data to justify detection to some extent.
1. MBS Design
❑ Specifically, in the MBS, we analyze data to characterize users' daily activities and
habits and determine whether a user is performing a threatening operation from three
perspectives:
1. Feature deviation is defined by the deviations between features to be detected
and features predicted.
2. Sequence deviation is the deviation between the action sequence to be detected
and the normal action sequence predicted.
3. Role deviation describes the degree of deviation between role features to be
detected and the role features calculated based on all features of users that are in
the same group.
2. Feature Extraction
❑ The goal of feature extraction is to identify relative information from log files and convert them
into a normalized representation
❑ Suitable features play a significant role in modeling a user’s normal behavior and capturing
deviations that are indicative of abnormal behaviors and can describe the threat profiles of
suspicious users.
❑ Hence, it is useful to identify different actions a user can take that allow their behaviors to be
modeled based on how insiders act and use them as our features.
❑ To generate a feature vector, the user’s activity like device login/logout, HTTP, file, and email is
used. This common feature of users is aggregated. Sessions are calculated for each of these
events. Session calculation helps in finding the active user time hence it takes into consideration
the device activity during weekdays and weekends, HTTP activity, email activity, and file
activity. The session calculation can predict the baseline for individual user’s activity.
Session Calculation
Session Calculation
1. Sort the data by user and timestamp in ascending order.
2. Feature Extraction
Action Features :
3. The daily activities may differ for every user due to different
sectors in which they participate. Hence, we analyze users’ behaviors
based on roles.
2. Feature Extraction
Action sequences :
1. Action sequences are sequential data that summarize the sequence of
behaviors for each time period.
2. We count every record of all activities of a user over a period first, and
then we sort all records according to their time recordings and obtain a
sequence of actions in a time series.
3. For instance, a user logs onto a computer first, and then he or she browses
three web pages, uses a thumb drive, sends an email, and finally logs out.
The action sequence is {log on, web, web, web, drive connect, drive
disconnect, email, log out}.
2. Feature Extraction
Role features :
3. LSTM Algorithm
- LSTM is a supervised learning method.
- Its capability of predicting the next state based on the current state sequence makes it the most widely used
technique in various regression domain problems and has achieved remarkable success.
- We train LSTMs to learn the normal action sequences and predict the action sequences of the next state
based on histories.
- Recurrent Neural Networks can be used to analyze the sequences using several layers of artificial neural
networks> LSTM is a particular algorithm that is used for the time-series analysis or sequence generation.
We take the success of LSTM’s sequence-generating capabilities to our benefit. we train LSTMs to learn
the normal action sequences and predict the action sequences of the next state based on histories. After
training the model, the deviation between the true action sequence and LSTM-generated action sequence
shows the anomaly detection.
3. convLSTM Algorithm
❑ convLSTM
(ii) for different activities, we select different features to learn a user’s daily
behavior characteristics, but there are some potential connections between the features of
different activities, and we can use CNN to extract these connections.
3. convLSTM Algorithm
1. Initialize the LSTM model : 5. Evaluate the LSTM model :
(a) Create an LSTM model object. (a) Using the testing data:
(b) Define the number of LSTM layers and the number of units i. Forward propagate the data through the trained LSTM
per layer.
(c) Set the appropriate activation functions for the LSTM layers model.
(e.g., sigmoid, tanh). ii. Calculate the predicted outputs.
iii. Compare the predictions to the actual labels to assess
2. Compile the model : model performance (e.g., accuracy, precision, recall).
(a) Define the loss function (e.g., binary cross-entropy) and the
optimizer (e.g., Adam). 6. Make predictions :
(b) Set any additional metrics for evaluation (e.g., accuracy). (a) Using new or unseen data :
i. Preprocess the data as done during training.
3. Preprocess the data :
(a) Perform any necessary data preprocessing steps, such as scaling or ii. Forward propagate the preprocessed data through the trained
normalization LSTM model.
(b) Split the data into training and testing sets. iii. Obtain the predicted outputs, which represent the likeli-
hood of anomalous behavior.
4. Train the LSTM model :
(a) Iterate over the training data in batches : 7. Interpret and act on the results :
i. Forward propagate the data through the LSTM layers. (a) Analyze the predicted outputs and set a threshold to classify
ii. Compute the loss between the predicted outputs and the behavior as normal or anomalous.
actual labels. (b) Raise alerts or take appropriate actions for behavior classified
iii. Backpropagate the error and update the model’s weights. as anomalous.
iv. Repeat for a specified number of epochs.
4. MLP
❑ we take advantage of MLP to combine the results of the multiple deep learning
models above to perform anomaly detection.
❑ It is a neural network where the mapping between inputs and output is non-
linear.
❑ MLP is a kind of neural network that consists of an input layer, hidden layers,
and an output layer and is often used to solve nonlinear problems.
4. MLP
Architecture
Architecture Diagram
Use Case
Class Diagram
Class Diagram
Dataset
The user and entity behavior analytics is performed using Computer Emergency Response
Teams(CERT) Dataset. The CMU generates the dataset and is one of the popular datasets amongst
User and Entity Behavior Analytics. The cert dataset is ideal for training and testing the proposed
system to withstand the Big Data. The dataset consists of several logs such as (log on.csv, email.csv,
device.csv, http.csv, file.csv, and psychometric.csv) of over 1000 employees. The dataset contains 502
days, 1000 users, and 32,770,227 logs. Some of the logs are manually injected by domain experts.
Along with logs, it contains metadata such as role, project, functional unit, department, team, etc. These
logs help to analyze the features and roles of several employees. The dataset is divided into training
and validation datasets.
Dataset
Num. device
Files exe copy
Files jpg copy
Files txt/doc/pdf copy
Files zip copy
Num. emails sending
Internal email sends
Num. Internal email receive
Num external email receive
Action Features Size of emails
Num. attachments
Num. websites
Num. career sites
Num. news sites
Num. tech sites, etc.
log on
log off
HTTP
Action Sequences device connect,
device disconnect,
Email, etc.
Implementation Details
• The deviation between real features and predicted results which tells us the degree of anomaly can
be seen through various graphs and diagrams. The details of LSTM and convLSTM models are
given in the following tables. The LSTM model has 2 layers and convLSTM has 3 layers.
Layers Parameters
Input Dim = 92
Layers Parameters Reshape Dim = (4,6,8,1)
Input Dim = 128 ConvLSTM Filters = 24, kernel size =
(2,3)
Reshape Dim = (4, 32)
Activation Function = relu
LSTM Units = 100
ConvLSTM Filters = 128, kernel size =
Activation Function = tanh (2,3)
LSTM Units = 160 Activation Function = tanh
Activation Function = tanh ConvLSTM Filters = 128, kernel size =
(2,3)
Dense Dim = 32
Activation Function = tanh
Activation Function = relu
maxpooling Pool size = (3,3)
Dense Dim = 48
Implementation Details
The deviation between true features and predicted which is anomaly is measured with WDD Loss. WDD
stands for Weighted Deviation Degree(WDD). WDD linearly measures the squared difference according
to the weighted value.
Here, V is the set of all features, y is the true value and y^ is the predicted value. W is the specially
designed weight value. Along with this WDD loss, for some models, the Mean Squared Error losses are
calculated. The MSE is the average of squared difference between true and predicted values.
The proposed model of the Malicious Behavior Detection System (MBS) aims to identify malicious
behaviors in users based on three perspectives. The model observed that normal and anomalous points
are distinguishable in their approach, although some false positives and false negatives exist. The
model utilizes a Multilayer Perceptron (MLP) to learn the relationships between the deviations and
determine abnormal behaviors. Experimental results demonstrate the effectiveness of MBS,
outperforming the baseline model in all metrics, with an AUC value of 0.96, indicating its high
effectiveness.
Screenshots
Screenshots
Screenshots
Screenshots
Screenshots
Application
1) Insider threat:
A rogue insider remains a source of information loss. The usage of ML behaviour models together with data risk monitoring and figuring
out high risk profiles, UEBA can monitor anomalies in the information that people couldn’t otherwise, apprehend or detect.
Conclusion
In this proposed system, we have implemented the Multimodel Based System for UEBA
with convLSTM Algorithm. The improved accuracy over a single model-based can be seen over
traditional Machine Learning algorithms. With the help of Deep Learning models, we have built a
system where the admin can detect insider attacks within the organization.
To enhance the proposed system we have created an Dashboard where the admin can
track the malicious users. The admin Dashboard is a web application where the admin can keep a
watch on the malicious user and receive alerts in case of an insider attack. A Real-time continuous
working software can be implemented with this model in the future.
Future Work
Future work for User and Entity Behavior Analytics (UEBA) involves exploring new avenues to enhance the effectiveness
and scope of behavior modeling and analysis. Here are some potential areas for future development:
2. Contextual analysis :
Incorporating contextual information into behavior modeling can provide deeper insights and better understand
the intent behind user and entity behaviors. Contextual factors such as user roles, time of day, location, and access patterns
can be considered to improve the accuracy of behavior models.
[1] M. Shashanka, M. -Y. Shen and J. Wang, "User and entity behavior analytics for enterprise security," 2016 IEEE International Conference on Big Data
(Big Data), 2016, pp. 1867-1874, doi: 10.1109/BigData.2016.7840805.
[2] B. Böse, B. Avasarala, S. Tirthapura, Y. -Y. Chung and D. Steiner, "Detecting Insider Threats Using RADISH: A System for Real-Time Anomaly Detection
in Heterogeneous Data Streams," in IEEE Systems Journal, vol. 11, no. 2, pp. 471-482, June 2017, doi: 10.1109/JSYST.2016.2558507.
[3] S. Khaliq, Z. U. Abideen Tariq and A. Masood, "Role of User and Entity Behavior Analytics in Detecting Insider Attacks," 2020 International
Conference on Cyber War- fare and Security (ICCWS), 2020, pp. 1-6, doi: 10.1109/ICCWS48432.2020.9292394.
[4] M. A. Salitin and A. H. Zolait, "The role of User Entity Behavior Analytics to de- tect network attacks in real time," 2018 International Conference on
Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT), 2018, pp. 1-5, doi: 10.1109/3ICT.2018.8855782.
[5] Zhihong Tian, Chaochao Luo, Hui Lu, Shen Su, Yanbin Sun, and Man Zhang. 2020. User and Entity Behavior Analysis under Urban Big Data. ACM/IMS
Trans. Data Sci. 1, 3, Article 16 (August 2020), 19 pages. https://doi.org/10.1145/3374749
[6] O. Carlsson and D. Nabhani, ‘User and Entity Behavior Anomaly Detection using Network Traffic’, Dissertation, 2017.
[7] D. Denning. An intrusion detection model. IEEE Trans. on Software Engg., 13(2), 1987.
[8] A. Pinto. Secure because math: A deep-dive on machine learning based monitoring. In Black Hat Briefings USA, 2014.
[9] R. Sommer and V. Paxson. Outside the closed world: On using machine learning for network intrusion detection. In Security and Privacy (SP), 2010
IEEE Symposium on, pages 305–316. IEEE, 2