0% found this document useful (0 votes)
48 views6 pages

The Insider Threat Detection Method of University Website Clusters Based On Machine Learning

The document discusses a machine learning-based method for detecting insider threats to university website cluster systems. It proposes a model that can automatically parse and detect anomalies in log data without labeled data. The model learns normal behavior patterns for different user types based on IP and role, and detects abnormalities compared to the learned patterns. An evaluation on a university cluster system showed the insider anomaly detection model performed well.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views6 pages

The Insider Threat Detection Method of University Website Clusters Based On Machine Learning

The document discusses a machine learning-based method for detecting insider threats to university website cluster systems. It proposes a model that can automatically parse and detect anomalies in log data without labeled data. The model learns normal behavior patterns for different user types based on IP and role, and detects abnormalities compared to the learned patterns. An evaluation on a university cluster system showed the insider anomaly detection model performed well.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2023 6th International Conference on Artificial Intelligence and Big Data

The Insider Threat Detection Method of University


Website Clusters Based on Machine Learning
2023 6th International Conference on Artificial Intelligence and Big Data (ICAIBD) | 978-1-6654-9125-9/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICAIBD57115.2023.10206282

Yangyang Li Yaping Su
School of Science School of Computer Science
Civil Aviation Flight University of China Civil Aviation Flight University of China
Chengdu, P. R. China Chengdu, P. R. China
[email protected] [email protected]

Abstract—In recent years, the informationization threats to university website cluster systems has become an
construction of universities has developed rapidly. The cluster urgent problem that requires to be solved [1], [2].
system has become a standard configuration for most university
websites. However, there have been many risks and problems in Currently, there are some related researches on detecting
the use of cluster systems, which have hindered the progress of insider threats to network systems. Among them, rule-based
information security in universities. Especially, the security and statistical-based methods are common methods for
threats caused by malicious behavior of internal personnel have detecting internal threats. For example, Bao proposes an
posed detection difficulties due to the fuzzy boundaries and internal threat detection method for smart grid data monitoring
limited sample data. This paper proposes a machine learning- devices based on behavior rules [3]. The method extracts three
based log anomaly detection model, which can automatically behavior rules to describe the behavior norms of each device to
parse and detect data in the log system for insider threats in the detect whether the behavior of the monitoring device deviates
universities’ cluster system without the need for annotated data. from the behavior norms. Davision takes the user command
Considering the differences between users, the model learns the sequence as the object, calculates the probability of the
behavior patterns of each user type based on their IP and role occurrence of adjacent command patterns, and judges whether
distinctions, and then detects abnormal behaviors based on the it belongs to an exception by matching the similarity between
learned normal behavior patterns. According to experimental new and historical commands [4]. However, these methods
evaluation of user data in the cluster system of the Civil Aviation
have some limitations, such as being easily circumvented and
Flight University of China, the results indicate that the insider
anomaly detection model proposed in this paper performs well.
deceived by attackers, unable to handle unknown attacks,
vulnerabilities, and changes in attackers, etc. In recent years,
Keywords—university website cluster system, insider threat with the continuous development of machine learning, user and
detection, log anomaly detection, machine learning, unsupervised entity behavior analysis based on machine learning technology
learning has been widely used in insider threat detection. This method is
mainly data-driven, through big data analysis, to extract
I. INTRODUCTION features of user behavior and classify them, judge whether the
With the rapid development of university websites, the user's behavior is normal, and find possible abnormal behaviors
website cluster system has become a standard feature for most to determine whether they are internal threats. However, the
university websites and is an important platform for university difficulty of this method lies in the processing of audit log data.
publicity, communication, and information dissemination. If different data are simply concatenated, it will cause problems
However, as the website cluster system is widely used in such as partial feature failure, high model training complexity,
universities, security issues have gradually surfaced. Network and model overfitting.
system security threats mainly come from two sources: Based on the above situation, this paper proposes a
external threats and insider threats. As various universities machine learning-based log abnormal detection method for
respond to the government call to gradually establish a internal threats to university website cluster systems and
relatively sound network security mechanism, the defensive verifies it using user operation data from the website cluster
capability for the attacks from outside is improved a lot. As a system of the Civil Aviation Flight University of China.
result, the insider threats should be given sufficient attention. Specifically, the research content of this paper includes data
Internal threats generally refer to intentional or unintentional per-processing, feature extraction, algorithm training, model
violations of rules by personnel within the organization, evaluation and other aspects. Through experimental
leading to security issues such as system paralysis, data loss, verification, the method proposed in this paper has achieved
and leakage. These security issues pose a serious threat to the good results in accuracy and recall rate. It provides a new
stable operation and normal development of university website solution for the insider threat detection of university website
cluster systems. Therefore, how to effectively monitor insider cluster systems which will protect the data and operation of

978-1-6654-9125-9/23/$31.00 ©2023 IEEE 560


Authorized licensed use limited to: UNIVERSITAT OBERTA DE CATALUNYA. Downloaded on March 02,2024 at 18:56:39 UTC from IEEE Xplore. Restrictions apply.
university website cluster systems and provide support for the In clustering-based log analyzers, the distances between logs
digital transformation of universities. are computed first, then clustering techniques are used to group
logs into different clusters, and finally, event templates are
II. RELATED WORKS generated from each cluster. For example, SLCT proposed by
A. Overview of University Website Cluster System Vaarandi et al. [7] , and IPLoM algorithm proposed by
Makanju et al. [8]. Heuristic log parsers, on the other hand,
A website cluster system is a software system that centrally merge frequently occurring words into event templates by
manages multiple websites. It typically consists of a main site calculating the occurrence frequency of each word at each log
and multiple sub-sites. In the case of a university website position. For instance, FT-Tree model proposed by Zhang et al.
cluster system, the main site is usually the school's official in 2017 [9], and Template2Vec model proposed by Meng et al.
homepage, while the sub-sites include the homepages of in 2019 based on synonym and antonym sets of template words
various colleges. Users can query and publish various types of [10].
information through the website cluster system. The website
cluster system is an efficient, centralized, and scalable website The second step in the process of developing a system for
management platform, with the following advantages: detecting insider threats is feature extraction. The behavior
sequence of a system or process will be constructed using
• Convenient management: The website cluster system can timestamps and log templates, and encoded into a feature
centrally manage multiple websites, including website content, vector for use in subsequent algorithms. To extract the
publishing, and updates. This avoids the cumbersome problem behavior sequence features, log data is usually divided into
of individually managing each website and improves website different groups using windows, with each group representing
management efficiency. a log sequence. There are two common types of windows used
• Cost savings: Using a website cluster system can save the for timestamps: fixed windows and sliding windows. The size
of a fixed window ΔT, as seen in Figure 1, is fixed and is
cost of building and maintaining multiple websites, as multiple
usually one hour or day. Consequently, the amount of fixed
websites can share some resources such as servers, themes, and
windows is determined by the window size. Logs that occur
plugins. Additionally, the website cluster system can automate
within the same window are considered part of the same log
some website management tasks, reducing labor costs.
sequence. On the other hand, a sliding window is characterized
• Unified style: The website cluster system can unify the by two characteristics: the window size, ΔT, and the step size,
style and interface design of the sites, making the sites have a Δt, which are both fixed values, with ΔT representing the time
consistent visual effect and user experience. This not only span for each window and Δt representing the time span for
improves users' trust and satisfaction with the sites but also each slide. The step size causes the window to slide backward,
facilitates users' information search and communication among as Figure 2 illustrates[11]. Logs that take place within the same
different sites. sliding window are then combined into a single log
sequence.The sliding process leads to overlap between
The security protection function of the university website different windows, which will increase the number of log
cluster system mainly includes three aspects: first, data backup sequences identified. Errors caused by uneven window
measures, which independently and remotely back up the data coverage can be lessened by it.
and templates of the website cluster system; second, setting up
an application firewall to protect the system from unauthorized
access and attacks; finally, an built-in anti-tampering module
that can effectively protect page security, and protect the Fig. 1. Fixed windows.
security of all sites in the website cluster.
B. Concept, Detection Methods, and Research Overview of
Insider Threats
Insider threats refer to attacks initiated by individuals or
system accounts within an organization, or unintentional but
harmful behavior, which may result in security issues such as
confidential data leakage, confidential data tampering, and
system service interruption. Insider threats are usually more
difficult to prevent and detect than outsider threats. Because
internal personnel have system permissions and can bypass
firewalls or other security measures, making it easier for them
to access system data and resources [5], [6].
Fig. 2. Sliding windows
The system log anomaly detection model for insider threats
typically includes the following three steps: Finally, anomaly detection, it is performed by algorithms
Firstly, log parsing, which involves structured analysis of based on the processed behavior sequence feature vectors.
raw logs using a log parser to extract log templates that Machine learning-based log anomaly detection methods can be
describe system or process behavior. There are two common divided into supervised and unsupervised approaches.
methods for log parsing: clustering-based and heuristic-based. Supervised methods include Logistic Regression, Decision
Trees, and Support Vector Machines (SVM). This methods

561
Authorized licensed use limited to: UNIVERSITAT OBERTA DE CATALUNYA. Downloaded on March 02,2024 at 18:56:39 UTC from IEEE Xplore. Restrictions apply.
need labeled training data. Their accuracy depends on the behavior sequence. Finally, the results are fed back to the
quantity and accuracy of the labeling. SVM-based log anomaly administrator. The administrator can then make judgments on
detection methods have complete theoretical support in small the results.
sample learning. But they are difficult to apply to multi-class
problems on large-scale datasets (Liang et al. [12] and Le et al. A. Log Parsing
[13]). Linear regression-based methods have been widely used The operational logs of the web cluster system refer to the
to solve binary classification problems (Breier et al. [14] and unstructured log text, which is collected in the log collection
He et al. [15]). But in practical applications, they require the system. They contain information such as time, IP, operation
anticipation of linear relationships between data. They cannot account, and specific operations. The main purpose of log
fit nonlinear relationships well. Decision tree-based methods parsing is to convert specific operation records in the log into a
(Chen et al. [16]) describe general decision processes by structured form, making it easier for the algorithm model to
constructing top-down data tree structures. They use multi- learn the behavior patterns of the log. In this model, according
level decisions to determine the true class of target data. But to the method of Frequent Template Tree (FT-Tree) model [9],
they will ignore the correlation between data features and the most extended combination of frequently occurring words
perform poorly in decision models when sample data is uneven. in the log is extracted to obtain the log template, which
Efficient log anomaly detection methods based on automatic implements the preprocessing of the original log data set.
labeling of samples using KNN (Ying et al. [17]) can fit Figure 4 shows an example of log parsing, where the top part is
existing data sample points to form clusters for category the original log information, and the bottom part is the
division. They will determine whether the current sample point structured log text corresponding to the log events.
is anomalous or not by calculating the distance from the
current point to each cluster center and using the best threshold
obtained by fitting with KNN. However, this method requires a
large amount of normal log data for learning. It is prone to
prediction errors. Also, it consumes more computation time
and memory compared to other methods.
Unsupervised methods include clustering, association rule
mining, and Principal Component Analysis, among others,
which do not require labeled training data. The LogCluster
method [8] identifies online anomalies by clustering logs.
CloudSeer [18] determines anomalies by detecting errors in
cross-log sequences. Some researchers have proposed
improved methods, such as Xu et al. using PCA [19] to detect
large-scale system issues by mining console logs, Lou et al. [20]
detecting system issues by mining invariants from console logs,
and Vaarandi et al. [21] proposing an event log data clustering
and pattern mining algorithm.
III. THE APPROACH
Machine learning-based log anomaly detection methods are
mainly divided into two stages: training stage and production
stage, as shown in Figure 3.

Fig. 4. Log parsing

B. Feature Extraction
Feature extraction mainly involves the partitioning and
vectorization of the parsed log features for further operations.
Fig. 3. The overall structure of detction methods
The process of log feature extraction is shown in Figure 5.
The training stage includes four steps: log parsing, feature Log sequence partitioning: Since individual operations are
extraction, algorithm learning, and representative behavior difficult to distinguish whether a user has violated the rules,
feature sequence storage. In the production stage, real-time sequence data can provide more valuable information by
logs in the actual production environment will be parsed and considering the user's behavior in the relevant context.
feature-extracted firstly. Then, the extracted features are Therefore, considering the differences in behavior patterns
compared with the representative behavior feature sequence in among users, the model first slices the parsed log features
the knowledge base to detect whether it is an anomalous based on IP and date, and then divides the sliced logs into

562
Authorized licensed use limited to: UNIVERSITAT OBERTA DE CATALUNYA. Downloaded on March 02,2024 at 18:56:39 UTC from IEEE Xplore. Restrictions apply.
multiple log sequences using a sliding window method. pair of clusters exceeds the distance threshold, the clustering
Furthermore, since users have different roles and process for that pair is terminated.
responsibilities, their operation behaviors may also differ.
However, users of the same role should have similar operation The central value is selected as a symbol of each cluster,
behaviors. Therefore, this model also identifies user roles in and stored in a database. This is accomplished by computing
log sequences and distinguishes different users automatically the score of each log sequence i in the cluster, based on its
based on their identity to conduct separate behavior pattern mean proximity to other log sequences in the same cluster:
learning and anomaly detection for each user.

where n is the number of log sequences in the cluster. We


choose the log sequence with the lowest score as the
representative sequence for each cluster.
During the detection phase, model will calculate the
similarity between the detected log sequence and the
representative event sequences in the knowledge. If the
maximum distance between the detected log sequence and all
representative event sequences is larger than a threshold α.
Then the detected log sequence is considered an anomalous
behavior sequence. The advantage of this method is that it does
Fig. 5. Feature extraction process not require labeled data. And also, it can learn log behavior
patterns automatically and capture changes in log behavior
Log vectorization: The log sequences are converted into log patterns.
event count vectors, and then weighted and corrected based on
the Inverse Document Frequency (IDF) method [22]. The IV. EXPERIMENTAL ANALYSIS
IDF’s equation is: A. Introduction of Experimental Dataset
In this study, we selected the log dataset of the Civil
Aviation Flight University of China web cluster system to
evaluate the model. The dataset contains more than 5,000
The amount of log sequences, denoted by N, and the operational logs of system users in the past two years. User
number of sequences where the event t appears, denoted by nt, types include system administrators, information officers, and
are both present. This means that frequently occurring log ordinary users. To facilitate the experiment, three types of
events across multiple log sequences have a lower weight abnormal logs were injected into the dataset, and the abnormal
because their judgment ability should be lower than those that types were manually labeled by security experts.
only appear in a few log sequences.
B. Experimental Results Analysis
C. Algorithm Learning To verify the effectiveness of the model proposed in this
In this part, we use clustering algorithm to learn normal paper, a total of three experiments were conducted. The first
user behavior log sequences and get representative log experiment has been proposed to investigate the impact of
sequences. Vaarandi [7] proposed clustering algorithm, based different parameters on the F1 score of the model. This
on text similarity and agglomerative hierarchical clustering, is experiment used the method of controlling variables. The
suitable for small datasets without labels due to its parameters of the model proposed in this paper include the
unsupervised learning. To determine which log sequences are maximum clustering distance θ and the abnormal threshold α.
similar, we compute the cosine similarity between vectors to The experimental results are shown in Figure 6.
determine the distance of each two log sequences. The
equation of similarity between vectors Si and Sj is:

The kth event in the jth sequence vector is denoted by SiEk.


During the agglomerative hierarchical clustering[23]
process, each log sequence forms its own cluster firstly. The
next step involves selecting and merging the closest pair of
clusters. We employs the maximum distance of all element
pairs between two clusters as the cluster distance metric to
determine which pair of clusters to merge. Then, we set a
distance threshold θ. When the maximum distance between a Fig. 6. F1 values heatmap

563
Authorized licensed use limited to: UNIVERSITAT OBERTA DE CATALUNYA. Downloaded on March 02,2024 at 18:56:39 UTC from IEEE Xplore. Restrictions apply.
According to Fig. 6, it can be observed that when the behavior patterns. Experimental results demonstrate that our
maximum clustering distance θ is fixed, the overall model outperforms existing unsupervised anomaly detection
performance of the model presents an upward trend followed models based on logs. In the next step, we will research the
by a downward trend as the abnormal threshold α varies. When false positive results to improve the detection model's
α is 0.4, the model performs relatively well. The change in performance. Also, we plan to integrate the model into the web
model performance with respect to the maximum clustering cluster system and evaluate its performance in practical
distance θ is not sensitive. When α is small, the model applications. In addition, we will also make further
performs better with smaller values of θ. But when α is large, improvements to the model based on more application results.
the larger values of θ will cause better performance. Overall,
the model performs best when the maximum clustering REFERENCES
distance θ and the abnormal threshold α are in the middle range. [1] Huang Jiachang. Discussion on Security risk and Countermeasures of
When θ = 0.5 and α = 0.4, the model performs the best with an university station Group System [J]. Information and Computer
F1 score of 0.62, precision of 0.5, and recall of 0.83. The main (Theory),2019,31(21):212-214. (in Chinese)
reason for the overall low accuracy of the model is that the [2] Zhu Xiaofu. Discussion on Key technologies of Security Level
Protection of university website Group [J]. Digital Technology and
sample size is small. The events involved in misclassified Application,2019,37(03):207-236. (in Chinese)
event sequences are the events which did not appear in the [3] BAO Haiyong, LU Rongxing, LI Beibei, et al. BLITHE: Behavior Rule-
training set. based Insider Threat Detection for Smart Grid[J]. IEEE Internet of
Things Journal, 2015, 3(2): 190-205.
The purpose of the second experiment is to verify the
[4] DAVISON B D, HIRSH H. Predicting sequences of user actions[C]//
effectiveness of the model proposed in this paper. Two popular Proceedings of the AAAI/ICML 1998 Workshop on Predicting the
unsupervised detection methods are selected in this paper for Future AI Approaches to Timeseries Analysis.[S.l.]: AAAI, 1998.
comparison, including Invariant Mining (IM)[20] and Isolation [5] Guo Shize, Zhang Lei, Pan Yu, Tao Wei, Bai Wei, Zheng Qibin, Liu Yi,
Forest (IF)[24]. The experimental results are shown in Figure 7. Pan Zhisong. A review of internal threat detection methods [J]. Data
Although, it can be seen that the recall rate of the proposed Acquisition and Processing,202,37(03):488-501. (in Chinese)
algorithm is slightly lower than that of the Invariant mining [6] Zhang You, Wang Kaiyun, Zhang Chunrui, Deng Miaoran. Review of
algorithm, the precision and F1 score of the proposed Internal Threat Detection based on User Behavior log [J]. Computer
Age,2020(09):45-49. (in Chinese)
algorithm is significantly higher than the other two algorithms.
[7] Vaarandi R.A data clustering algorithm for mining patterns from event
logs[C]/ /Proceedings of the 3rd IEEE Workshop on IP Operations&
amp,Management,2003: 119-126.
[8] Makanju A, Zincir-Heywood A N , Milios E E . A lightweight
algorithm for message type extraction in system application logs
[J]. IEEE Transactions on Know Ledge and Data Engineering,
2011,24 ( 11) : 1921-1936.
[9] Zhang S,Meng W,Bu J,et al. Syslog processing for switch failure
diagnosis and prediction in datacenter networks[C]/ /IEEE /ACM
25th International Symposium on Quality of Service ( IWQoS ) ,
IEEE,2017: 1-10.
[10] Meng W , Liu Y , Zhu Y , et al . LogAnomaly: unsupervised
detection of sequential and quantitative anomalies in unstructured logs
[ C ] / / International Joint Conference on Artificial Intelligence
( IJCAI) , 2019: 4739-4745.
Fig. 7. Comparison of experimental results of Logcluster, Invariant Mining [11] Feng Shilong, Tai Xianqing, Ma Zhijie. An improved anomaly detection
(IM) and Isolation Forest (IF) in accuracy rate, recall rate and F1 method based on log clustering [J]. Computer Engineering and
Design,2020,41(04):1087-1092. (in Chinese)
The third experiment has compared the detection with and [12] LIANG Y, ZHANG Y, XIONG H, et al. Failure prediction in ibm
without distinguishing user identity, in order to verify whether bluegene/l event logs[C]//Seventh IEEE International Conference on
the model in this paper could more effectively identify Data Mining (ICDM 2007). IEEE, 2007: 583-588.
different user behavior patterns. In the case of distinguishing [13] Le D C, Zincir-Heywood A N. Evaluating insider threat detection
user identity, the accuracy of the model was 0.5, the recall rate workflow using supervised and unsupervised learning[C]//2018 IEEE
was 0.83, and the F1 value was 0.62. However, for the model Security and Privacy Workshops (SPW).IEEE,2018:270-275.
without using identity, the accuracy was 0.25, the recall rate [14] BREIER J, BRANIŠOVÁ J. Anomaly detection from log files using
data mining techniques[M]//Information Science and Applications.
was 0.64, and the F1 value was 0.35. The experimental results Springer, Berlin, Heidelberg, 2015: 449-457.
showed that the model in this paper, which used user identity, [15] HE P, ZHU J, HE S, et al. Towards automated log parsing for large-
achieved better results than the normal LogCluster model. scale log data analysis[J]. IEEE Transactions on Dependable and Secure
Computing, 2017, 15(6): 931-944.
V. CONCLUSION [16] CHEN M, ZHENG A X, LLOYD J, et al. Failure diagnosis using
In this paper, a machine learning-based method for decision trees[C]//International Conference on Autonomic Computing,
2004. Proceedings. IEEE, 2004: 36-43.
detecting insider threats in the universities web cluster system
has been proposed. The model can distinguish the normal [17] YING S, WANG B, WANG L, et al. An Improved KNN-Based
Efficient Log Anomaly Detection Method with Automatically Labeled
behavior patterns of different users by using time stamps and Samples[J]. ACM Transactions on Knowledge Discovery from Data
user identity, and detect anomalies by computing the similarity (TKDD), 2021, 15(3): 1-22.
between the detected log sequences and the learned normal

564
Authorized licensed use limited to: UNIVERSITAT OBERTA DE CATALUNYA. Downloaded on March 02,2024 at 18:56:39 UTC from IEEE Xplore. Restrictions apply.
[18] Yu X,Joshi P,Xu J,et al. CloudSeer: workflow monitoring of [21] VAARANDI R, PIHELGAS M. Logcluster-a data clustering and pattern
cloud infrastructures via interleaved logs [ J ] . ACM Sigarch mining algorithm for event logs[C]//2015 11th International conference
Computer Architecture News,2016,44( 2) : 489-502. on network and service management (CNSM). IEEE, 2015: 1-7.
[19] Xu W , Huang L , Fox A , et al . Detecting large-scale system [22] C. D. Manning, P. Raghavan and H. Schütze. Introduction to
problems by mining console logs [ C ] / /Proceedings of the Information Retrieval, Cambridge University Press, 2008.
ACMSIGOPS 22nd Symposium on Operating Systems Principles , [23] J. C. Gower, G. J. S. Ross, "Minimum spanning trees and single linkage
2009: 117-132 cluster analysis", Journal of the Royal Statistical Society, Series C 18 (1):
[20] Lou J,Fu Q,Yang S,et al. Mining invariants from console logs for 54–64, 1969
system problem detection [ C ] / /Proceedings of the 2010USENIX [24] Liu F T,Ting K M,Zhou Z H. Isolation forest[C]/ / 2008 Eighth
Annual Technical Conference , Berkeley , CA , USENIX IEEE International Conference on Data Mining,2008
Association,2010: 1-14

565
Authorized licensed use limited to: UNIVERSITAT OBERTA DE CATALUNYA. Downloaded on March 02,2024 at 18:56:39 UTC from IEEE Xplore. Restrictions apply.

You might also like