Network Attack Detection and Visual Payload Labeli
Network Attack Detection and Visual Payload Labeli
Abstract
In recent years, Internet of things (IoT) devices are playing an important role in business, education, medical as well as in
other fields. Devices connected to the Internet is much more than the number of world population. However, it may
face all kinds of attacks from the Internet easily for its accessibility. As we all know, most attacks against IoT devices are
based on Web applications. So protecting the security of Web services can effectively improve the situation of IoT eco-
system. Conventional Web attack detection methods highly rely on samples, and artificial intelligence detection results
are uninterpretable. Hence, this article introduced a supervised detection algorithm based on benign samples. Seq2Seq
algorithm is been chosen and applied to detect malicious web requests. Meanwhile, the attention mechanism is intro-
duced to label the attack payload and highlight labeling abnormal characters. The results of experiments show that on
the premise of training a benign sample, the precision of proposed model is 97.02%, and the recall is 97.60%. It explains
that the model can detect Web attack requests effectively. Simultaneously, the model can label attack payload visually
and make the model ‘‘interpretable.’’
Keywords
Seq2Seq, IoT devices, Web attack detection, autoencoder, attention mechanism, attack visualization
Creative Commons CC BY: This article is distributed under the terms of the Creative Commons Attribution 4.0 License
(https://creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work
without further permission provided the original work is attributed as specified on the SAGE and Open Access pages
(https://us.sagepub.com/en-us/nam/open-access-at-sage).
2 International Journal of Distributed Sensor Networks
complexity of its configuration. These factors are rele- observed that normal HTTP requests were a majority,
vant to the fact that Web attacks are happening with and the behavior patterns were similar, while malicious
increasing frequency, through which attackers retrieve requests were a minority and the behavior patterns
or change sensitive data or even execute arbitrary code were changeable.19 They proposed an unsupervised
in remote systems. algorithm based on text clustering to distinguish nor-
Most attacks against IoT devices are based on Web mal requests from malicious requests. Rather, it proved
applications. To deal with the various attack methods, to be in high detection rate and low false alarm rate.
it has been a trend for security researchers to apply Zhang et al. picked up upper 300 bytes characters in
machine-learning and deep-learning to web applica- Web communication traffic through statistical analysis
tions attack detection technique. to Webshell traffic. The sequence vector was generated
A trust-aware probability marking traceback scheme based on American Standard Code for Information
is proposed to locate malicious source quickly.5 The Interchange (ASCII).20 Then CNN (convolutional
nodes are marked with different marking probabilities neural networks)21 and LSTM (long short-term mem-
according to its trust which is deduced by trust evalua- ory)22 trained sequence vectors to build a model and
tion. The high marking probability for low trust node classify sample data. This method obtained rather good
can locate malicious source quickly, and the low mark- results with a 98.2% detection rate and 97.84% recall
ing probability for high trust node can reduce the num- rate.
ber of marking to improve the network lifetime, so the The above detection methods affect well since data
security and the network lifetime can be improved in sets have been given. However, there are still some
this scheme. Wu et.al.6 proposed a safety detection problems that need resolving:
mechanism based on the analysis of big data. Fuzzy
cluster analytical method, game theory, and reinforce- 1. The lack of label data: There are numerous nor-
ment learning are integrated seamlessly to perform the mal request samples while there are few varie-
safety detection. The simulation and experimental gated attack samples in real environment, which
results show the advantages of this scheme in terms of causes obstacles to the model’s learning and
high efficiency and low error rate. training.
Adeva and Atxa7 proposed another attack identify 2. The lack of sample classes: In the stage of train-
method. This method helps extract metadata from the ing, if there are only SQL injection and XSS
Weblog, including date, source address, size, type, and attacks in the sample data sets it is hard to iden-
so on. Besides, it allows selecting the best feature tify command executions or new Payload
through the feature assessment, classifying log samples attacks in real environment. Besides, Web
or identifying attacks with Naive Bayes algorithm,8 applications run by different users vary greatly.
k-nearest neighbors (KNN) algorithm,9 and Rocchio Even SQL injection has numerous forms.
algorithm.10 The method has a high detection rate of Obviously, we cannot be sure to use a data col-
more than 90%. However, it is just a post-test and can- lected in the past to train a model that can
not defend attacks in time. Raghuveer and detect unknown attacks. It is easy to understand
Chandrasekhar11 proposed a detection model combin- that the results will be deadly different in the
ing support vector machine (SVM),12 fuzzy neural net- experimental environment from the real
work,13 and K-means.14 This model clusters and environment.
generates various subsets by K-means algorithm, and 3. The interpretability of the results: If the model
then trains different neuro-fuzzy models to gain eigen- identifies SQL injection, security researchers can
vectors. Rathore et al.15 designed a cross-site scripting find out the exact location of the attack payload
(XSS) classifier based on social networking services so that they can maintain the Web applications
(SNS) website, three types of eigenvalues (URL, consciously. But common Web maintainers may
HTML labels, and SNS) are processed by artificial not understand the significance of the alarm.
choice, meanwhile the classifier constructs eigenvectors. Even though they constrain the attacks at that
Then, 10 machine-learning algorithms, including RF very moment, they still cannot repair the Web
(Random Forest),16 ADTree (Alternating Decision applications. Because of that, the security risk
tree),17 SVM, LR (Logistic Regression)18 and so on, still exists.
will testify and identify whether the eigenvectors are
attacks. This method compares each algorithm and As services provided by IoT devices are often subject
obtains the best algorithm model, but it has limitation to Web attacks, to improve the security of IoT devices,
on scalability and human factors which greatly affect an attack detection model based on Seq2Seq23 to imple-
the detection results, as it requires artificial mainte- ment the shortage of current Web attack detection tech-
nance and obtaining large numbers of eigenvectors arti- nologies is proposed in this article. This model helps
ficially. Through Web visit statistics, Yang et al. acquire a great many of normal samples, identify kinds
Shi et al. 3
of Web attacks efficiently, and locate attack payload maintainers are able to locate the attack pay-
timely. We summarize the major contributions as load swiftly, repair security risks in time, and
following: protect the data security of enterprises or
organizations.
1. We propose a visual payload labeling model to
detect network attack. Under the premise of
using only benign training samples, our model Web attack detection model based on
has good precision and recall. Seq2Seq
2. As our model relies on comparing predicted val-
Detection model framework
ues and thresholds to classify benign and mali-
cious Web requests, it can identify whether a Figure 1 presents the whole framework of the Web
Web request is malicious, rather than defend attack detection model based on Seq2Seq. The model is
against a specific type of Web attacks. mainly divided into three modules: data preparation
3. Our model not only distinguishes normal module, attack detection module, and attack payload
requests from attack requests but also interprets visualization module. In the data preparation module,
the detection results by visually labeling the pretreating original HTTP request samples, establish-
attack payloads. In the stage of encoding, we ing vocabulary, generating sequence vectors that meet
encode request samples from HTTP through the model’s input requirements happend in sequence.
the Bi-LSTM algorithm and maintain the con- In the attack detection module, the main task is to con-
text semantics in the request. In the decoding struct and train the attack detection model as well as
stage, we introduce the attention mechanism, testify and classify the test sample sets. In the attack
calculate the probability distribution of each payload visualization module, the attack payload is
character in the sequence vector, and mark the visually labeled, and normal elements (characters) are
exact location of the attack payload. The detect labeled as white but abnormal elements (characters) are
results of our model are interpretable. Website labeled as red.
4 International Journal of Distributed Sensor Networks
Y1 Y2 Y3
X1 X2 X3 X4
ht = f (ht1 , xt ) ð2Þ
0
ht = f (ht1 , xt ) ð3Þ Attack detection algorithm based on measuring the loss of
model. The last section introduced the basic framework
f is the encoding function of Bi-LSTM, ht is the output of Seq2Seq. Seq2Seq needs modifying before it is
of the forward LSTM hidden layer, and h0t is the output applied to detect attacks. In Figure 5, we take training
of the reverse LSTM hidden layer. For semantic coding samples as input and output of the model at the same
C, the output information of Encoder’s hidden layer is time, and this model is also called autoencoder. This
generally aggregated to obtain the semantic vector of model is almost the same as the Seq2Seq model dia-
the middle layer gram in Figure 3. The main difference is that the out-
put layer also uses the same data as the input layer. It
C = q½H1 , H2 , H3 , :::, Ht ð4Þ should be noted that in the decoding stage, the first
A common simple method is to use the hidden layer input of the sequence is replaced by ‘‘\GO.,’’ and the
output of the last moment as the semantic vector C, last output of the sequence is replaced by ‘‘\EOS..’’
that is To train only with positive samples and process
attack detection, this article designed an attack detec-
C = q½H1 , H2 , H3 , :::, Ht = Ht ð5Þ tion algorithm based on measuring the loss of a model.
The procedures are as follows:
3. Calculate the mean and standard deviation of aij represents the corresponding weight of each hid-
total loss in equation (2), and calculate the den layer and is calculated by the following formula
threshold using the following formula
aij = softmax(eij ) ð12Þ
threshold = mean(total loss) + C std(total loss) ð10Þ
eij is a score calculated by the output Hi of the enco-
In the above formula, mean refers to calculate mean der’s hidden layer and the output Hj0 of the decoder’s
value, and std refers to calculate the standard deviation. hidden layer. The score is calculated by the following
C is a constant, and we need to adjust it in experiments, formula
so that threshold can gradually approach the optimal 0
threshold. Generally speaking, C should ensure that the eij = score(Hi , Hj ) ð13Þ
threshold value is greater than the maximum value of For the score function, Luong et al.24 defines the fol-
Loss:Max of the test set Loss. lowing three definitions, which can be selected accord-
ing to the different problems
4. The model predicted benign samples and mali-
8
cious samples at the same time. If the > HiT Hj
0
<
Loss.threshold, the malicious samples are 0
HiT Wa Hj
0
score(Hi , Hj ) = ð14Þ
judged; otherwise, the Loss\threshold of the >
: V tanh(W ½H T ; H 0 )
sequence is the normal samples. a a i j
2. Decoder
Attack payload visualization module based on
attention mechanism The decoding stage is determined by the current time
0
Seq2Seq model with attention mechanism. To solve the semantic vector Ct and the output Ht of the decoder’s
problem that cannot explain the results of the conven- hidden layer. First, we contacted the two vectors, and
tional detection model, this section will optimize the use tanh as the activation function
Seq2Seq model using the attention mechanism and 0
mark the specific location of attack payload using the Ht = tanh(Wc ½Ct ; Ht ) ð15Þ
characteristics of this mechanism, to realize the visuali- Finally, the predicted output Yt is calculated
zation function of attack payload. The optimized model
is shown in Figure 6. Yt = argmaxP(Yt ) = softmax(Wc , Ht ) ð16Þ
We should note that the output Yt at this time is a
1. Encoder
probabilistic sequence. By looking up the maximum
probability value in the sequence, the corresponding
After leading into the attention mechanism, the word is retrieved from the vocabulary list and decoded.
semantic vector C is obtained by weighted averaging
the output H of encoder’s hidden layer, as follows
Attack payload labeling principle based on attention
X
T mechanism. Formula (15) shows that the output Ht of
Ci = p(aij Hj ) ð11Þ the attention layer is determined by semantic vector Ct
0
j=1 and Ht , the output of the Decoder’s hidden layer. At
Shi et al. 7
Y1 Y2 Yt
Output Layer 0 0 1 0 1 0 1 0 0
softmax
WC WC WC
this time, the semantic vector Ct represents the weight of correct prediction, the No.i element of the sequence
of the current input compared to the output of the should be the maximum value of the sequence and far
model, which is similar to the human attention greater than the value of other elements. (Figure 6 is a
mechanism. It is the focus of visual attention with a simple demonstration. The No.i element of the prob-
large weight, and on the contrary, it may be low-value abilistic sequence is set to 1 and the rest is set to 0.)
information with a low weight. Based on this conclusion, the following steps can be
Assuming that the input of Xt is the No.i element in taken to optimize the model:
the vocabulary at a certain time, the output of Yt at the
current time is a probabilistic sequence of sum 1, which 1. The test samples are predicted by the
is as long as the vocabulary and the sum of them is 1, trained model, receive the output probabilistic
as shown in Figure 7 and Formula (16). On the premise sequence
8 International Journal of Distributed Sensor Networks
Yij = ½ai1 , ai2 , ai3 , :::, ait ð17Þ divide data sets into three parts: training samples,
testing samples, and detection samples. Among them,
In the formula, Yij refers to the No.i sequence, the
we take 30% benign samples randomly to do thresh-
No.j element in the vocabulary, T is the length of the
old test training, and 1001 benign samples and 1001
vocabulary, recording the current value of aij .
malicious samples randomly to do abnormal element
threshold test. Away from that, this model only
2. Count all outputs of samples, set as a
adopts benign samples in training, but in comparative
alpha = ½aij ð18Þ
experiments, it adopts benign samples and malicious
samples simultaneously. The specific allocation of
Calculate the mean value and standard deviation of data sets is shown in Table 1.
a, and use the following formula to calculate the thresh-
old. In the formula, the constant C is a constant to be
determined, mean value is calculated by mean, and stan- Environment for experiment
dard deviation is calculated by std The model introduced in this article is mainly devel-
oped under Windows system. The code involved in the
threshold = mean(alpha) C std(alpha) ð19Þ experiment is mainly based on Python’s tensorflow
3. By adjusting the constant C, make sure the framework.26 Function static bidirectional rnn in ten-
threshold value is guaranteed to be less than the sorflow is applied to realize encoder of Bi-LSTM algo-
minimum weight of benign samples in the test rithm. Function Seq2Seq in tensorflow is applied to
set, and greater than the maximum weight of realize encoder with attention mechanism. Python
malicious samples, such as Formula (20). Scikit-learn toolkit helps to realize assessment for
Meanwhile, it is necessary to observe whether model.27 Detailed configurations are shown in Table 2.
the sample labeling conforms to objective facts.
If it does, the threshold value should be selected;
Experiment process
otherwise, it will continue to adjust
Classification threshold parameter optimization. In sections
maliciousmax \threshold ł benignmin ð20Þ ‘‘Attack detection module based on Seq2Seq’’ and
‘‘Attack payload visualization module based on atten-
tion mechanism,’’ the calculation methods of threshold
4. If a sequence in the test set is checked by the
value of model classification and threshold value of
model, the model predicts the No.j element
exceptional determination have been introduced in
aij \threshold in the probabilistic sequence of
detail, but the formula cannot directly calculate the
Yij , it indicates that Yij is abnormal and labeled
final threshold value. Further experimental tests are
red; otherwise, if aij .threshold, it indicates that
necessary to get the optimal threshold value. Formula
Yij is normal and labeled white.
(10) shows that constant C needs to be adjusted to
obtain a reasonable threshold value to achieve the goal
Experiment and assessment of sample classification. We tested the accuracy change
with a constant C from 1 to 7 in steps of 2, and specifi-
Data set cally tested the accuracy with a constant 0. The rela-
This article applied HTTP DATASET CSIC 2010 tionship between threshold value and accuracy is
data set to to do experiments and make analysis.25 shown in Table 3.
After processing, we stored 20,331 pieces of benign It is understandable that the higher the threshold is,
samples and 16,243 malicious samples. According to the higher the accuracy of benign samples is, but the
the detection method introduced in this article, we model also needs to detect malicious samples, so while
Shi et al. 9
Table 2. Hardware and software configuration of experimental value of the classification of abnormal elements accu-
environment. racy, we only calculate an estimated value by Formula
(19): threshold = mean(alpha) C std(alpha). If the
Operating system Microsoft Windows 10 Build 17134
value meets the expectation, it will be adopted, conver-
Professional edition, 64-bit
operating system sely, then adjusted. According to the statistics of 1001
System configuration CPU: Intel i7-7700, Memory: 8GB, benign samples and 1001 malicious samples, mean(al-
Hard Disk: 1TB pha) = 0.67052 and std(alpha) = 0.4568767 were
GPU: NVIDIA GeForce GTX 1060, obtained. Threshold adjustment calculations are as
Display Memory:6GB
follows
The Python standard Python:3.6.2
library and version tensorflow-gpu == 1.12.0
numpy == 1.16.0 1. Initial value C = 0, step size = 0.1 and maxi-
scikit-learn == 0.19.2 mum value = 1.5;
matplotlib == 2.2.2 2. Calculate threshold value by formula (19);
colorama == 0.4.1
3. Ten malicious samples and 10 normal samples
were printed randomly to observe whether to
label the labels of attack payload;
Table 3. Threshold test results. 4. If it does not meet the expectation, repeat (1) to
(3) until it meets the expectation, and store the
Number Constant C Threshold Accuracy (%) current threshold value.
1 0 0.076447 72.57
2 1 0.215574 94.27 After many rounds of experiments, we set the thresh-
3 3 0.493829 99.20 old value to 0.076589 (constant C = 1.3), when the
4 5 0.772084 99.54 output meets the target of labeling attack payload.
5 7 1.050339 99.58 Figure 8 is an example of labeling attack payloads.
ensuring the accuracy, the smaller the threshold, the Experiment indicators
more consistent with the classification standard of the
To get better evaluation of the attack detection model
model. As shown in the table above, when the constant
based on attention mechanism and Bi-LSTM algo-
C is 5 and 7, the accuracy no longer increases signifi-
rithm, the experimental results will be evaluated using
cantly, and the threshold value is 0.772084.
Confusion Matrix. The obfuscation matrix, also known
as Error Matrix, can be used to visually evaluate the
Threshold parameter optimization of abnormal performance of classification model algorithms, as
elements. Because we cannot quantify the threshold shown in Table 4.
Table 6. Model performance indicators. 99.48%, but the recall is 94.66%. It shows that
Detection models Precision (%) Recall (%) F1 (%)
Character_CNN has a high false alarm rate. The preci-
sion, recall, and F1 values of Attention_Bi-LSTM are
SVM 92.07 94.95 93.49 as high as 97.02%, 97.60%, and 97.31%, respectively.
TF-IDF_RF 93.81 89.12 91.41 Also, Attention_Bi-LSTM has the largest AUC (the
Word2vec_MLP 96.28 96.29 96.29 area under the ROC curve). It shows that on the pre-
Character_CNN 99.48 94.66 97.01
Attention_Bi-LSTM 97.02 97.60 97.31 mise of benign training samples alone, this model can
detect attack requests effectively and has rather high
precision and recall. Besides, its exclusive function of
labeling attack payload helps to achieve attack
visualization.
However, the model has some shortcomings. The
model constructs sequence vectors by using character
embedding. Although it shortcuts the steps of artificial
word segmentation and feature extraction, it enlarges
calculation. There are only around 20,000 data sets,
but the training time is more than 10 h. We will con-
sider adopting the embedding method of N-gram to
process experiments or improve hardware resources of
experiments.
Acknowledgements
The authors thank anonymous reviewers and editors for pro-
viding helpful comments on earlier drafts of the manuscript.
Conclusion References
Based on experimental data sets, tests on Attention_Bi- 1. Gu Y, Wang Y, Liu Z, et al. Sleepguardian: an RF-based
LSTM, SVM, TF-IDF_RF, Word2vec_MLP, healthcare system guarding your sleep from afar, 2019,
Character_CNN are processed. The results indicate arXiv:1908.06171v1.
2. Gu Y, Zhang Y, Li J, et al. Sleepy: wireless channel data
that SVM and TF-IDF_RF have relatively low detec-
driven sleep monitoring via commodity WiFi devices.
tion rate; their precision is 92.07% and 93.81%, respec-
IEEE T Big Data. Epub ahead of print 28 June 2018.
tively; their recall is 94.95% and 89.12%, respectively. DOI: 10.1109/TBDATA.2018.2851201.
The detection and recall of Word2vec_MLP are of 3. Gu Y, Ren F and Li J. Paws: passive human activity rec-
average, 96.28% and 96.29%, respectively. ognition based on wifi ambient signals. IEEE Internet
That means extracting word vectors by Word2vec Thing J 2015; 3(5): 796–805.
can maintain samples’ semantics and make classifica- 4. Gu Y, Zhang X, Li C, et al. Your WiFi knows how you
tion as well. The precision of Character_CNN reaches behave: leveraging WiFi channel data for behavior
12 International Journal of Distributed Sensor Networks
analysis. In: Proceedings of the 2018 IEEE global commu- 16. Breiman L. Random forests. Machine Learn 2001; 45(1):
nications conference (GLOBECOM), Abu Dhabi, UAE, 5–32.
9–13 December 2018, pp.1–6. New York: IEEE. 17. Freund Y and Mason L. The alternating decision tree
5. Liu X, Dong M, Ota K, et al. Trace malicious source to learning algorithm. In: Proceedings of the sixteenth inter-
guarantee cyber security for mass monitor critical infra- national conference on machine learning, ICML ’99, Bled,
structure. J Comput Syst Sci 2018; 98: 1–26. Slovenia, 27-30 June 1999, pp.124–133. New York:
6. Wu J, Ota K, Dong M, et al. Big data analysis-based ACM.
security situational awareness for smart grid. IEEE T Big 18. Castilla E, Martı́n N and Pardo L. A logistic regression
Data 2016; 4(3): 408–417. analysis approach for sample survey data based on Phi-
7. Adeva JJG and Atxa JMP. Intrusion detection in web divergence measures. In: Gil E, Gil E, Gil J, et al. (eds)
applications using text mining. Eng Appl Artif Intell Mathematics of the uncertain. New York: Springer, 2018,
2007; 20(4): 555–566. pp.465–474.
8. Tang B, He H, Baggenstoss PM, et al. A bayesian classi- 19. Yang X, Wei LI, Sun M, et al. Web attack detection
fication approach using class-specific features for text method on the basis of text clustering. CAAI Trans Intel
categorization. IEEE Trans Knowl Data Eng 2016; 28(6): Syst 2014; 9: 40–46.
1602–1606. 20. Zhang H, Guan H, Yan H, et al. Webshell traffic detec-
9. Li D, Zhang B and Li C. A feature-scaling-based -nearest tion with character-level features based on deep learning.
neighbor algorithm for indoor positioning systems. IEEE IEEE Access 2018; 6: 75268–75277.
Internet Thing J 2015; 3(4): 590–597. 21. Yandong LI, Hao Z and Lei H. Survey of convolutional
10. Ding Q, Zhang J, Wang J, et al. Based on knn and roc- neural network. J Comput Appl 2016; 36: 2508–2515.
chio improved text classification technology. Autom 22. Greff K, Srivastava RK, Koutnı́k J, et al. LSTM: a
Instrum 2017; 8: 41. search space odyssey. IEEE T Neural Netw Learn Syst
11. Chandrasekhar AM and Raghuveer K. Intrusion detection 2016; 28(10): 2222–2232.
technique by using k-means, fuzzy neural network and 23. Google. seq2seq, 2017, https://google.github.io/seq2seq/
SVM classifiers. In: Proceedings of the international confer- 24. Luong MT, Pham H and Manning CD. Effective
ence on computer communication and informatics, Coimba- approaches to attention-based neural machine transla-
tore, India, 4–6 January 2013, pp.1–7. New York: IEEE. tion, 2015, arXiv:1508.04025v5.
12. Suykens JAK and Vandewalle J. Least squares support 25. Giménez CT, Villegas AP and Marañón GA. Http data
vector machine classifiers. Neural Proces Lett 1999; 9(3): set CSIC 2010. Information Security Institute of CSIC
293–300. (Spanish Research National Council), 2010, https://
13. Lin CT, George Lee CS, Lin CT, et al. Neural fuzzy sys- www.impactcybertrust.org/dataset_view?idDataset=940
tems: a neuro-fuzzy synergism to intelligent systems, vol. 26. Abadi M, Barham P, Chen J, et al. Tensorflow: a system
205. Upper Saddle River NJ: Prentice Hall, 1996. for large-scale machine learning. In: Proceedings of the
14. Krishna K and Murty NM. Genetic k-means algorithm. 12th USENIX symposium on operating systems design and
IEEE T Syst Man Cyb: Part B 1999; 29(3): 433–439. implementation (OSDI 16), Savannah, GA, 2–4 Novem-
15. Rathore S, Sharma PK and Park JH. Xssclassifier: an ber 2016, pp.265–283. New York: ACM.
efficient XSS attack detection approach based on 27. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-
machine learning classifier on SNSs. J Inform Process learn: machine learning in python. J Machine Learn Res
Syst 2017; 13(4): 1014–1028. 2011; 12: 2825–2830.