Teamdl at Semeval-2018 Task 8: Cybersecurity Text Analysis Using Convolutional Neural Network and Conditional Random Fields
Teamdl at Semeval-2018 Task 8: Cybersecurity Text Analysis Using Convolutional Neural Network and Conditional Random Fields
000 050
001 TeamDL at SemEval-2018 Task 8: Cybersecurity Text Analysis using 051
002 Convolutional Neural Network and Conditional Random Fields 052
003 053
004 Manikandan R 1∗, Krishna Madgula2 , Snehanshu Saha1,2 054
1
005 CAMMS, Dept of CSE ,PESIT-Bangalore South Campus 055
2
006 PESIT-Bangalore South Campus 056
007 [email protected] 057
008 [email protected] 058
009 [email protected] 059
010 060
011 061
Abstract of deep learning approaches (Zhou et al., 2016;
012 062
Liang and Zhang, 2016; Kim, 2014; Kalchbrenner
013 In this paper we present our participation to 063
et al., 2014; Zhang et al., 2015), Support vector
014 SemEval-2018 Task 8 subtasks 1 & 2 respec- 064
machines, logistic regression (Genkin et al., 2007;
015 tively. We developed Convolution Neural Net- 065
work system for malware sentence classifi-
Jiang et al., 2016) and Tree based approaches
016 (Bouaziz et al., 2014). On the other hand, sub- 066
cation (subtask 1) and Conditional Random
017 task 2 was formulated as sequence tagging prob- 067
Fields system for malware token label pre-
018 diction (subtask 2). We experimented with lem which is addressed till date by CRF (Finkel 068
019 couple of word embedding strategies, fea- et al., 2005; R. et al., 2016, 2017), deep learn- 069
020 ture sets and achieved competitive perfor- ing approaches (Chiu and Nichols, 2016; Ma and 070
021 mance across the two subtasks. Code is made Hovy, 2016; Lample et al., 2016) and SVM (Ekbal 071
available at https://bitbucket.org/
022 and Bandyopadhyay, 2012). 072
vishnumani2009/securenlp
023 In this paper, we describe our system that ad- 073
024 1 Introduction dresses subtasks 1 and 2 involving malware sen- 074
025 tence classification and malware token label pre- 075
026 Cybersecurity risks and malware threats are be- diction. We designed these systems by adapting 076
027 coming common and increasingly dangerous re- various insights from previous works on text clas- 077
028
quiring analysis of large repositories of malware sification and sequence tagging. We submitted a 078
029
related information in realtime to understand its Convolutional Neural Network(CNN) based sys- 079
capabilities and mount an effective defense. The tem based system for subtask 1 and Conditional
030 080
sheer volume of data and its potential applica- Random Field (CRF) based system for subtask 2.
031 081
tions alone have increased traction in recent times The rest of the paper is organized as follows. In
032 082
among NLP researchers. In this line, SemEval section section 2, we discuss datasets and prepro-
033 083
2018 Task-8 offers 4 subtasks addressing text clas- cessing. In section 3, we describe the algorithms
034 084
sification and token, relation and attribute label and features used in the process of model devel-
035 085
prediction in cybersecurity domain using Mal- opment. In section 4, we describe our results and
036 086
wareTextDB (Lim et al., 2017). While subtask 1 some of our findings. Finally in section 5, we con-
037 focuses on predicting sentences relevance to mal- 087
clude with summary and possible implications on
038 ware , subtasks 2, 3 and 4 focus on predicting to- 088
future work.
039 ken, relation and attribute labels for malware text 089
040 from subtask 1. More details about the each of the 2 Dataset and Preprocessing 090
041 subtasks can be found in Lim et al. (2017). 091
042 Concerning subtask 1, which was inherently The MalwareTextDB corpus used for this work 092
043 formulated as a text classification problem very consists of APT reports describing malware re- 093
044 few works are done till date in cybersecurity do- ported information taken from APTnotes1 . We de- 094
main (Lim et al., 2017; Zhang et al., 2016). How- signed an end-to-end pipeline consisting on three
045 095
ever, in general domain the problem of text clas- module which process input text across multiple
046 096
sification is well addressed with extensive usage stages. In stage 1, the input sentence is fed to
047 097
a preprocessing module which pre-processes the
048 ∗
Work performed during weekend part time assistantship 098
1
049 at CAMMS https://github.com/aptnotes/ 099
1
NAACL-HLT 2018 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
2
NAACL-HLT 2018 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
3
NAACL-HLT 2018 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
300 CRF-Strict CRF-Relaxed gazette features owing to its deterministic nature. 350
301 P R F P R F Hence, we submitted CRF only with common fea- 351
302 test17 0.51 0.26 0.34 0.45 0.36 0.40 tures described in section 3.2.1 for final evalua- 352
303 dev18 0.18 0.25 0.21 0.38 0.22 0.29 tion. With this system we achieved a result of 353
304 test18 0.29 0.23 0.25 0.42 0.30 0.36 0.25 and 0.36 in strict and relaxed evaluation re- 354
305 spectively. Our accuracy is 3.5% (avg) behind the 355
Table 5: Results of subtask 2 on Conditional Random
306
fields
top performing system across the evaluations. We 356
307 identified following sources of errors i) Tagging of 357
308 tokens in sentences containing only actions but not 358
tently outperformed Word2Vec embeddings. This entities - these are sentences with only attackers
309 359
is in line with works of Kim (2014). We ini- actions in line with error from subtask 1 ii) Lack
310 360
tially hypothesized that since ”the context of the of sensitivity to context - some tokens in test doc-
311 361
malware texts are different from normal English ument are given same label from train irrespective
312 362
texts”, task-specific embeddings would improve of context iii) Miss tagging of some of the tokens
313 363
the results of subtask 1. However, we observe with common suffixes. For subtask 2, we exper-
314 364
that task specific embeddings produced lower re- imented with simple CRF architecture with basic
315 365
sults compared to native embeddings. Observa- features, hence we believe further exploration of
316 tions of results revealed high false negative pre- 366
future engineering is needed to reduce context re-
317 dictions of non-malware texts, we believe that this 367
lated errors. As far as addressing rest of the errors,
318 may attributed to limited dataset used for develop- 368
we plan to explore combination of rule based and
319 ing embeddings, unlike native embeddings which 369
deep learning approaches.
320 was created using very large corpus. This results 370
321 also agrees the general observation, that the size 5 Conclusion 371
322 of the training corpus has often a greater impact 372
323 on results than its strict matching with the target In this work, we developed CNN and CRF sys- 373
324 domain(Tourille et al., 2017). tems for malware text classification and token la- 374
325 For subtask 1, we achieved an accuracy of bel prediction, achieving competitive results. For 375
326 0.50 and were 7% behind the top performing sys- subtask 1, we experimented with couple of word 376
327 tems. We identified three different sources of er- embedding strategies and found native glove em- 377
328 rors across the sentences in line with previous bedding to be useful. For subtask 2, we used CRF 378
329 works(Lim et al., 2017) namely misclassification with simple features achieving results closer to top 379
330 of i) Sentences consisting of malware related key- performing system and above the official bench- 380
331
words without implication on actions; ii) Sen- mark. Further, we described various sources of er- 381
332
tences describing attacker actions and addition- rors identified in the due process of analysis. In 382
333
ally we also found iii) misclassification of sen- future, we plan to further improve our system to 383
tences containing specific patterns like presence of show higher performance based on the above ob-
334 384
PATH and EXE . Further, we had initially servations.
335 385
hoped that the multichannel architecture would
336 386
prevent overfitting(Kim, 2014) and thus work bet- Acknowledgments
337 387
ter than the single channel model, especially on
338 388
small datasets like MalwareTextDB. The results, We thank the task organizers for providing ac-
339 389
however, are vice versa and hence further work on cess to MalwareTextDB corpus and organizing the
340 390
regularizing the training process and simpler sin- shared task. Further, we would like to thank vari-
341 gle channel architecture is warranted. 391
ous authors for open sourcing the codes of various
342 For subtask 2, during analysis we found that algorithms used in this work. 392
343 there were multiple malware names which were 393
344 previously unseen and felt only orthographic fea- 394
345 tures would be insufficient. Hence in addition to References 395
346 commonly used features, we also included gazette 396
Ameni Bouaziz, Christel Dartigues-Pallez, Célia
347 features with words that quantify malware entity. da Costa Pereira, Frédéric Precioso, and Patrick
397
348 However, during evaluation on development set Lloret. 2014. Short text classification using seman- 398
349 we found high drop in precision when we used tic random forest. In DaWaK. 399
4
NAACL-HLT 2018 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
400 Jason P. C. Chiu and Eric Nichols. 2016. Named en- Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram- 450
401 tity recognition with bidirectional lstm-cnns. TACL, fort, Vincent Michel, Bertrand Thirion, Olivier 451
402 4:357–370. Grisel, Mathieu Blondel, Peter Prettenhofer, Ron 452
Weiss, Vincent Dubourg, Jacob VanderPlas, Alexan-
403 François Chollet et al. 2015. Keras. https:// dre Passos, David Cournapeau, Matthieu Brucher, 453
404 github.com/fchollet/keras. Matthieu Perrot, and Edouard Duchesnay. 2011. 454
405 Scikit-learn: Machine learning in python. Journal 455
Asif Ekbal and Sivaji Bandyopadhyay. 2012. Named of Machine Learning Research, 12:2825–2830.
406 entity recognition using support vector machine: A 456
407 language independent approach. Jeffrey Pennington, Richard Socher, and Christo- 457
408 pher D. Manning. 2014. Glove: Global vectors for 458
Jenny Rose Finkel, Trond Grenager, and Christo- word representation. In EMNLP.
409 459
pher D. Manning. 2005. Incorporating non-local
410 information into information extraction systems by Sarath P. R., Manikandan R, and Yoshiki Niwa. 2016. 460
411 gibbs sampling. In ACL. Hitachi at semeval-2016 task 12: A hybrid approach 461
412
for temporal information extraction from clinical 462
Alexander Genkin, David D. Lewis, and David Madi- notes. In SemEval@NAACL-HLT.
413 gan. 2007. Large-scale bayesian logistic regression 463
414 for text categorization. Technometrics, 49:291–304. Sarath P. R., Manikandan R, and Yoshiki Niwa. 2017. 464
415
Hitachi at semeval-2017 task 12: System for tem- 465
Mingyang Jiang, Yanchun Liang, Xiaoyue Feng, Xiao- poral information extraction from clinical notes. In
416 jing Fan, Zhili Pei, Yu Xue, and Renchu Guan. 2016. SemEval@ACL. 466
417 Text classification based on deep belief network and 467
418 softmax regression. Neural Computing and Appli- Radim Řehůřek and Petr Sojka. 2010. Software Frame- 468
cations, pages 1–10. work for Topic Modelling with Large Corpora. In
419 Proceedings of the LREC 2010 Workshop on New 469
420 Nal Kalchbrenner, Edward Grefenstette, and Phil Blun- Challenges for NLP Frameworks, pages 45–50, Val- 470
421 som. 2014. A convolutional neural network for letta, Malta. ELRA. http://is.muni.cz/ 471
modelling sentences. In ACL. publication/884893/en.
422 472
423 Yoon Kim. 2014. Convolutional neural networks for Frank Seide and Amit Agarwal. 2016. Cntk: Mi- 473
424 sentence classification. In EMNLP. crosoft’s open-source deep-learning toolkit. In 474
KDD.
425 475
Diederik P. Kingma and Jimmy Ba. 2014. Adam:
426 A method for stochastic optimization. CoRR, Nitish Srivastava, Geoffrey E. Hinton, Alex 476
427 abs/1412.6980. Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi- 477
428
nov. 2014. Dropout: a simple way to prevent neural 478
Guillaume Lample, Miguel Ballesteros, Sandeep Sub- networks from overfitting. Journal of Machine
429 ramanian, Kazuya Kawakami, and Chris Dyer. 2016. Learning Research, 15:1929–1958. 479
430 Neural architectures for named entity recognition. 480
431
In HLT-NAACL. Julien Tourille, Olivier Ferret, Xavier Tannier, and 481
Aurélie Névéol. 2017. Limsi-cot at semeval-2017
432 Quoc V. Le and Tomas Mikolov. 2014. Distributed rep- task 12: Neural architecture for temporal infor- 482
433 resentations of sentences and documents. In ICML. mation extraction from clinical narratives. In Se- 483
434 mEval@ACL. 484
Depeng Liang and Yongdong Zhang. 2016. Ac-
435 blstm: Asymmetric convolutional bidirectional Wenpeng Yin, Katharina Kann, Mo Yu, and Hin- 485
436 lstm networks for text classification. CoRR, rich Schütze. 2017. Comparative study of cnn 486
437 abs/1611.01884. and rnn for natural language processing. CoRR, 487
abs/1702.01923.
438 488
Swee Kiat Lim, Aldrian Obaja Muis, Wei Lu, and
439 Ong Chen Hui. 2017. Malwaretextdb: A database Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 489
440 for annotated malware articles. In ACL. 2015. Character-level convolutional networks for 490
text classification. In NIPS.
441 491
Edward Loper and Steven B Bird. 2002. Nltk: The
442 natural language toolkit. CoRR, cs.CL/0205028. Ye Zhang and Byron C. Wallace. 2017. A sensitiv- 492
443 ity analysis of (and practitioners’ guide to) convo- 493
Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end lutional neural networks for sentence classification.
444 494
sequence labeling via bi-directional lstm-cnns-crf. In IJCNLP.
445 CoRR, abs/1603.01354. 495
446 Yunan Zhang, Qingjia Huang, Xinjian Ma, Zeming 496
447
Christopher D. Manning, Mihai Surdeanu, John Bauer, Yang, and Jianguo Jiang. 2016. Using multi- 497
Jenny Rose Finkel, Steven Bethard, and David Mc- features and ensemble learning method for imbal-
448 Closky. 2014. The stanford corenlp natural language anced malware classification. 2016 IEEE Trust- 498
449 processing toolkit. In ACL. com/BigDataSE/ISPA, pages 965–973. 499
5
NAACL-HLT 2018 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
500 Peng Zhou, Zhenyu Qi, Suncong Zheng, Jiaming Xu, 550
501 Hongyun Bao, and Bo Xu. 2016. Text classification 551
502
improved by integrating bidirectional lstm with two- 552
dimensional max pooling. In COLING.
503 553
504 Zhi-Hua Zhou and Xu-Ying Liu. 2006. Training cost- 554
505
sensitive neural networks with methods addressing 555
the class imbalance problem. IEEE Transactions on
506 Knowledge and Data Engineering, 18:63–77. 556
507 557
508 558
509 559
510 560
511 561
512 562
513 563
514 564
515 565
516 566
517 567
518 568
519 569
520 570
521 571
522 572
523 573
524 574
525 575
526 576
527 577
528 578
529 579
530 580
531 581
532 582
533 583
534 584
535 585
536 586
537 587
538 588
539 589
540 590
541 591
542 592
543 593
544 594
545 595
546 596
547 597
548 598
549 599