0% found this document useful (0 votes)
58 views9 pages

Detecting Unseen Malicious VBA Macros With NLPTechniques

This document presents a method for detecting unseen malicious VBA macros using natural language processing techniques. The method extracts features from macro source code using bag-of-words, TF-IDF, and Doc2vec representations. These feature vectors are then used to train classifiers like SVM, Random Forests, and MLP. Experimental results showed the method could detect 89% of new malware families, with a best F-measure of 0.93. The method analyzes macros directly rather than document files, aiming to overcome limitations of previous detection approaches.

Uploaded by

Prakash Chandra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views9 pages

Detecting Unseen Malicious VBA Macros With NLPTechniques

This document presents a method for detecting unseen malicious VBA macros using natural language processing techniques. The method extracts features from macro source code using bag-of-words, TF-IDF, and Doc2vec representations. These feature vectors are then used to train classifiers like SVM, Random Forests, and MLP. Experimental results showed the method could detect 89% of new malware families, with a best F-measure of 0.93. The method analyzes macros directly rather than document files, aiming to overcome limitations of previous detection approaches.

Uploaded by

Prakash Chandra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Journal of Information Processing Vol.27 555–563 (Sep.

2019)
[DOI: 10.2197/ipsjjip.27.555]

Regular Paper

Detecting Unseen Malicious VBA Macros with NLP


Techniques

Mamoru Mimura1,a) Hiroya Miura1


Received: December 10, 2018, Accepted: March 5, 2019

Abstract: In recent years, the number of targeted email attacks which use Microsoft (MS) document files has been in-
creasing. In particular, malicious VBA (Visual Basic for Applications) macros are often contained in the MS document
files. Some researchers proposed methods to detect malicious MS document files. However, there are a few methods
to analyze malicious macros themselves. This paper proposes a method to detect unseen malicious macros with the
words extracted from the source code. Malicious macros tend to contain typical functions to download or execute the
main body, and obfuscated strings such as encoded or divided characters. Our method represents feature vectors from
the corpus with several NLP (Natural Language Processing) techniques. Our method then trains the extracted feature
vectors and labels with basic classifiers, and the trained classifiers predict the labels from unseen macros. Experimental
results show that our method can detect 89% of new malware families. The best F-measure achieves 0.93.

Keywords: VBA macro, machine learning, natural language processing technique, bag-of-words, Doc2vec, TFIDF

pattern files, and the pattern files have to be updated. Most attack-
1. Introduction ers, however, obfuscate programs to evade detection. Therefore,
In recent years, email has become one of the most popular com- it is difficult to detect unseen malicious macros that contain new
munication tools. This situation has led to targeted email attacks malware families.
becoming a big threat to society. A targeted email attack is a spe- To detect unseen malicious macros, some previous methods
cific attack in which the attacker attempts to persuade a victim can be applied. For instance, Nissim et al. [2] analyzed the struc-
to run specific action. Depending on the specific action, there ture of docx files, and proposed a method to detect malicious
are two types of targeted email attacks. One is to open malicious docx files. These previous methods detect malicious MS docu-
links and download a malicious program, and the other is to open ment files. These methods, however, do not discriminate between
malicious attachments. Attackers attempt to earn credibility with malicious macros and benign macros. Hence, if a MS document
their victims through an eloquent mail text. Moreover, the at- file contains benign macros, these methods might detect the be-
tackers convince victims to unknowingly download a malicious nign macros as malicious ones. If the malicious MS document
file or click-through to a malicious site. According to a report file is camouflaged with the structure of a benign MS document
published by Sophos [1], malicious attachment files are used in file, the attacker can probably evade the detection method. De-
most targeted email attacks. The report shows that 85% of the tecting malicious macros themselves can overcome these weak-
attached files are Microsoft Office (MS) document files. Further- nesses. However, there are a few methods to analyze malicious
more, most of the MS document files contain malicious VBA (Vi- macros themselves [3], [4]. Hence, there is room for improve-
sual Basic for Applications) macros. VBA is an implementation ment on these methods.
of Microsoft’s programming language Visual Basic 6, and built This paper proposes a method to detect unseen malicious
into most Microsoft Office applications. Malicious macros have macros themselves. Malicious VBA macros tend to contain typ-
a long history. For example, the LOVELETTER worm, one of the ical functions to download or execute the main body, and ob-
most famous malicious macro infected more than 45 million com- fuscated strings such as encoded or divided characters. To in-
puters, and some organizations suffered serious damage in 2000. vestigate the source code, we focus on NLP (Natural Language
Subsequently, malicious macros gradually faded out. They grew Processing) techniques. NLP techniques are usually used to an-
in popularity again with the rise of targeted email attacks. Macros alyze natural languages as the name indicates. In this paper, we
are a powerful tool to automate common tasks in MS document presume VBA macros are written in a natural language, and at-
files. However, malicious macros use this functionality to infect tempt to learn the difference between benign and malicious VBA
the computer. In targeted attacks, attackers often use unseen mali- macros with Doc2vec. Doc2vec is an unsupervised algorithm that
cious macros which are not detected by anti-virus programs with learns fixed-length feature representations from variable-length
the latest definition. In general, anti-virus programs require virus pieces of texts. Then we input the extracted feature vectors with
the label into supervised learning models to classify benign and
1
National Defense Academy, Yokosuka, Kanagawa 239–8686, Japan malicious VBA macros. The key idea of this research is reading
a)
[email protected]


c 2019 Information Processing Society of Japan 555
Journal of Information Processing Vol.27 555–563 (Sep. 2019)

VBA macros as a natural language. To the best of our knowl- 2.1 MS Document File
edge, Doc2vec has never been applied to detecting malicious Nissim et al. proposed a framework (ALDOCX) that classi-
VBA macros. Doc2vec enables extracting feature vectors from fies malicious docx files using various machine learning classi-
VBA macros automatically. That is why we focus on Doc2vec. fiers [2]. ALDOCX creates feature vectors from the path struc-
Our method uses some NLP techniques to investigate the ture of docx files. Naser et al. proposed a method to detect mali-
macro’s source code. Our method divides the source code into cious docx files [7]. The method parses the structure of docx files,
words, and represents feature vectors from the corpus with sev- and analyzes suspicious keywords. These methods do not support
eral NLP techniques. Term Frequency (TF) and Term Frequency- Compound File Binary (CFB) file format. Our method, however,
Inverse Document Frequency (TFIDF) are used to select im- supports MS document files which conform to Compound File
portant words for classification. TF is a simple method which Binary (CFB) file format and Office Open XML (OOXML) file
weights the value corresponding to frequency of words in a cor- format.
pus. TFIDF is a more sophisticated method which weights a rep- Otsubo et al. proposed a tool (O-checker) to detect malicious
resentative word in a corpus. Bag-of-Words (BoW) and Doc2vec document files (e.g., rtf, doc, xls, pps, jtd, pdf) [8]. O-checker
represent feature vectors from the corpuses. BoW is a basic detects malicious document files which contain executable files,
method that represents vectors corresponding to the frequency of using deviation of file format specifications. Boldewin imple-
the words. Doc2vec is a more complicated model that represents mented a tool (OfficeMalScanner) to detect MS document files
vectors from the context of the documents. Our method then uses which contain malicious shellcodes or executable files [9]. The
the extracted feature vectors and labels, and trains basic clas- tool scans entirely malicious files, and detects features of strings
sifiers, Support Vector Machine (SVM), Random Forests (RF) of Windows API, shellcode patterns, and embedded OLE data.
and Multi Layer Perceptron (MLP). Finally, the trained classi- This tool scores each document corresponding to each of the fea-
fiers predict the labels from unseen macros. Experimental results tures. If the scores are more than a threshold, this tool judges
show that our method can detect 89% of new malware families. the file as malicious. Mimura et al. proposed a tool to deobfus-
The best F-measure achieves 0.93. cate embedded executable files in a malicious document file (e.g.,
This paper addresses 4 research questions as follows. doc, rtf, xls, pdf) and detect them [10]. These methods focused
( 1 ) Does our method detect unseen malware families? on embedded malicious executable files or shellcodes, and do not
( 2 ) Does Doc2vec represent feature vectors effectively? detect malicious macros. Our method on the other hand, detects
( 3 ) What is the best combination of these NLP techniques and malicious macros.
classifiers?
( 4 ) Does TFIDF select important words to classify macros? 2.2 VBA macro
In order to address these questions, we conduct some experi- There are a few methods to detect malicious macros. Bearden
ments. Based on the results, this paper makes the following con- et al. proposed a method of classifying MS Office files containing
tributions: VBA macros as malicious or benign using the K-Nearest Neigh-
( 1 ) Propose a method to detect unseen malicious macros which bors machine learning algorithm, feature selection, and TFIDF
contain new malware families [5]. where p-code opcode n-grams compose the file features [3]. This
( 2 ) Doc2vec is effective in classifying malicious macros [6]. study achieved 96.3% file classification accuracy. However, the
( 3 ) Linear classifiers are effective for Doc2vec [6]. samples were only 40 malicious and 118 benign MS Office files.
( 4 ) Reducing words using Term Frequency is effective for clas- This paper provides more reliable results with thousands of dis-
sifying macros [6]. tinct samples. Kim et al. focused on obfuscated source code and
We will introduce the structure of this paper. Section 2 intro- proposed a method to detect malicious macros with a machine
duces related work and reveals the differences between this pa- learning technique [4]. This method extracts feature vectors from
per and other relevant study. Section 3 describes malicious VBA obfuscated source code. Therefore, this method might not detect
macros, and Section 4 presents some NLP techniques. Section 5 malicious VBA macros which are not obfuscated. Our method
proposes the method, and Section 6 conducts experiments. Sec- uses not only features in obfuscated macros, but also other fea-
tion 7 discusses the results, and finally, describes conclusion. tures.

2. Related Work 2.3 PDF File


In targeted email attacks, attackers use attachment files which Some researchers deal with the detection of malicious PDF
contain malicious codes. Methods to detect these malicious files files. For instance, Igino Corona et al. proposed a method to
can be categorized into static analysis and dynamic analysis. Our classify malicious files according to the frequency of suspicious
method does not execute the MS document files or VBA macros. reference APIs [11]. Liu et al. proposed a method which ana-
Therefore, we focus on static analysis in this section. The at- lyzed obfuscated scripts to classify malicious PDF files [12]. This
tachment files are mainly categorized into executable files and method uses the characteristics of obfuscation, which is com-
document files. Our method investigates document files. The mon to our method. These methods classify malicious PDF files.
document files are roughly categorized into MS document files Our method investigates MS document files and detects malicious
and PDF files. VBA macros are embedded in MS document files. macros.
We will show the details in the followings.


c 2019 Information Processing Society of Japan 556
Journal of Information Processing Vol.27 555–563 (Sep. 2019)

Table 1 Typical functions in malicious macros. Table 2 Typical obfuscation methods.


Downloader Dropper # summary
CreateObject function CustomProperties collection 1 Replace statement name, etc.
Shell function 2 Encode and decode with ASCII code
SendKey statement 3 Use XOR
Declare statement 4 Split characters
5 Use reflection functions

3. Malicious VBA Macro macros. Method 1 replaces class names, function names et al.
with random strings. The random strings tend to be more than
3.1 Behavior
20 characters. Method 2 encodes and decodes strings with ASCII
This section describes the behavior of malicious macros, and
code. VBA macros provide AscB function and ChrB function.
reveals their features. Attackers use a slick text of the type that
AscB function encodes characters to ASCII codes. ChrB func-
the victim expects, and induces the victim to open an attachment.
tion decodes ASCII codes to characters. Method 3 encodes and
When the victim opens the attachment and activates the macro,
decodes characters by XOR operation. Method 4 divides a string
the macro compromises the computer. There are two types of
into characters. The divided characters are assigned to variables.
malicious macros, Downloader and Dropper.
By adding together those variables, the original string is restored.
Downloader is a malicious macro which forces the victims
Method 5 uses reflection functions which execute the strings as
computer to download the main body. When Downloader con-
instructions. These strings contain function names, class names
nects to the server, it tends to use external applications. Finally,
and method names. Attackers often conceal malicious functions
the computer downloads and installs the main body from the
with these techniques.
server.
In contrast, Dropper contains the main body in itself. When 4. NLP Technique
a victim opens the attachment of a phishing email, Dropper ex-
4.1 Bag-of-Words
tracts the code and executes it as an executable file. The differ-
BoW is a basic method to extract feature vectors from a doc-
ence between Dropper and Downloader is that Dropper contains
ument. BoW represents the frequency of a word in a document,
the main body in itself. Therefore, Dropper can infect victims
and extracts matrix from documents. In this matrix, each row cor-
without communicating with external resources.
responds to each document, and each column corresponds to each
Malicious macros tend to contain functions to download or ex-
unique word in documents. This method does not consider word
tract the main body in the source code. Hence, our method at-
order or meaning. In this method, the number of unique words
tempts to detect these features.
corresponds to the dimension of matrix. Thus, the more num-
ber of unique words increases, the more the number of matrix
3.2 Typical Function
dimensions increases. Therefore, methods to adjust the number
To detect these features, we focus on typical functions de-
of dimensions are required. To adjust the number of dimensions,
scribed in the source code. Table 1 shows the typical func-
important words have to be selected.
tions which are frequently described in Downloader and Drop-
per. CreateObject function returns a temporary object which
4.2 Term Frequency-Inverse Document Frequency
is part of an external application function. For example, if
Term Frequency-Inverse Document Frequency (T FIDF) is
CreateObject function parses “InternetExplorer.Application”
one of the most popular methods for selecting important words.
as an argument, the function accesses Internet Explorer. Shell
We will introduce how a T FIDF value is calculated.
function enables to execute an argument as a file name. SendKey
D
statement and Declare statement also appear in the Downloader. T FIDFi, j = f requencyi, j × log2
document f requencyi
SendKey statement sends keystrokes to the active window as if
typed at the keyboard. Attackers use SendKey statement with The f requencyi, j (TF) is the frequency of a word i in a docu-
Shell function to execute any commands. Declare statement is ment j. The document f requencyi is the frequency of documents
used to declare references to external procedures in a dynamic- in which the word i appears. The IDF is the logarithm of a value
link library. The Declare statement allows accessing a variety in which D (the number of total documents) is divided by the
of functions. CustomProperties represents additional informa- document f requencyi . T FIDF value is a value which is a prod-
tion, and the information can be used as metadata. Attackers fre- uct of T F and IDF. Finally, T FIDF values are normalized. In
quently use CustomProperties collection to conceal malicious this model, if a word appears rarely in an entire corpus and ap-
binary code. pears frequently in a document, the TFIDF value increases.

3.3 Obfuscation 4.3 Doc2vec


Most malicious macros are obfuscated to prevent analysis. Doc2vec [13] is an extension of Word2vec [14]. First, we will
Therefore, capturing obfuscated strings is an effective method introduce Word2vec. Word2vec is a model that is used to rep-
for detecting malicious macros. We will show some obfuscation resent word embeddings. Word2vec is a two-layer neural net-
techniques of the source code. work that is trained to reconstruct the linguistic context of words.
Table 2 shows typical obfuscation techniques in malicious Word2vec has a hidden layer and an output layer. The input of


c 2019 Information Processing Society of Japan 557
Journal of Information Processing Vol.27 555–563 (Sep. 2019)

Word2vec is a large corpus of documents, and Word2vec rep-


resents the input in feature vectors. The number of dimensions
of the feature vector is typically several hundred. Each unique
word in the corpus is assigned a corresponding element of the
feature vector. Word vectors are positioned in the vector space
such that common contexts in the corpus are positioned in close
proximity to one another in the space. This is based on the prob-
ability of words co-occurrence around a word. Word2vec has
two algorithms, Continuous Bag-of-Words (CBoW) and Skip-
gram. CBoW is an algorithm which predicts a centric word from Fig. 1 An outline of the proposed method.
surrounding words. Skip-gram is an algorithm which predicts
Table 3 Special characters as the delimiter.
surrounding words from a centric word. Word2vec enables ob-
special character name special character name
taining similarity of words, and also predict equivalent words. ” double quote + plus
Doc2vec has two algorithms, Distributed Memory (DM) and Dis- ’ single quote / slash
tributed Bag-of-Words (DBoW). DM and DBoW are extensions {} square bracket & and
() round bracket % percentage
of CBoW and Skip-gram respectively. Doc2vec enables to ob- , comma ¥ yen sign
tain similarity of documents, and also extract feature vectors from . period $ dollar sign
∗ asterisk # sharp
documents. - hyphen @ at mark
5. Proposed Method Table 4 The patterns to replace.
5.1 Outline method pattern replaced word
1 Hexadecimal 1 (e.g., 0xXX) 0xhex
We proposes a method to detect unseen malicious macros with 2 Hexadecimal 2 (e.g., &HXX) andhex
NLP techniques. Our method requires a language model and a 3 Asc, AscB, AscW asc
classifier to detect malicious macros. Figure 1 shows an outline 4 A string of 20 or more characters longchr
5 A number of 20 digits or more longnum
of the proposed method. 6 Element of array elementofarray
In the training phase, our method requires both malicious and
benign samples with the labels. Step 1 extracts words from la- a language model with the selected words. To construct a lan-
beled macros in MS document files. Step 2 selects important guage model, our method uses BoW and Doc2vec. Thereafter,
words and constructs a language model with the corpus. Then our method converts the words into feature vectors with the lan-
the extracts words are converted into feature vectors with the lan- guage model. In the test phase, our method uses the constructed
guage model. Step 3 trains classifiers with the extracted feature language model, and converts the words extracted from unknown
vectors and labels. macros. Our method uses gensim-2.0.0 [16] to implement BoW
In the test phase, our method investigates unlabeled samples. and Doc2vec. Gensim has many functions related to natural lan-
Step 1 extracts words from unknown macros in MS document guage processing techniques. The Doc2vec model is trained 30
files. Step 2 converts extracted words into feature vectors with epochs with the DBoW algorithm. The number of dimensions is
the language model. Step 3 classifies the extracted feature vectors 100. These parameters are determined through a trial and error
with the trained classifiers, and the predicted labels are obtained. process.

5.2 Extract Word 5.4 Classifier


Our method extracts macros from MS document files with Our method uses the extracted feature vectors and labels, and
Olevba [15]. Olevba is open source software that can extract trains the classifiers, Support Vector Machine (SVM), Random
macros from MS document files. Then, our method divides the Forests (RF), and Multi Layer Perceptron (MLP). These classi-
source code into words. Our method uses some special characters fiers are fundamental and often used for this field. Our method
as the delimiter. Table 3 shows the special characters. uses scikit-learn-0.18.1 [17] to implement SVM, RF, and MLP.
Thereafter, our method replaces some patterns. Table 4 shows Scikit learn is a machine learning library and has many classi-
the patterns. fication algorithms. These classifiers use default values for all
These patterns frequently appear in malicious macros. Be- parameters.
cause most malicious macros are obfuscated with these meth-
ods [4], [6]. If our method does not replace these patterns, each
6. Experiment
word is handled as each feature respectively. These words, how- 6.1 Environment
ever, have a common context or meaning. Our method replaces This section conducts experiments to evaluate our method. Ta-
these patterns into single words to improve accuracy. ble 5 shows the environment. We implemented our method with
Python 2.7 in this environment.
5.3 Language Model
In the training phase, our method selects important words 6.2 Dataset
based on the TF or TFIDF values. Then, our method constructs To evaluate our method, we use actual malicious and benign


c 2019 Information Processing Society of Japan 558
Journal of Information Processing Vol.27 555–563 (Sep. 2019)

Table 5 Environment. Table 9 Confusion Matrix.


CPU IntelCorei7 (3.30 GHz) actual value
memory 32 GB true false
OS Windows8.1Pro predicted positive TP FP
result false FN TN
Table 6 The numbers of samples.
2015’s samples 2016’s samples 6.3 Evaluation Measure
(Training) (Test) To evaluate accuracy, this paper uses Precision, Recall, and F-
benign malicious benign malicious
622 515 1,200 641 measure as metrics. These metrics are defined as follows.
TP
Table 7 The rates of malware families in 2015’s samples. Precision =
T P + FP
family rate TP
Recall =
1 O97M/Donoff 78.0% T P + FN
2 O97M/Adnel 5.2%
2Recall × Precision
3 O97M/Bartallex 4.3% F − measure =
4 W97M/Adnel 3.3% Recall + Precision
5 X97M/Donoff 2.8%
6 O97M/Madeba 0.9% Table 9 shows the confusion matrix.
7 O97M/Farheyt 0.9% F-measure is useful metrics and considers both the precision
8 None 0.9% and recall. Since this paper does not investigate the details of
9 O97M/Daoyap 0.9%
10 W97M/Bartallex 0.6% detection rates deeply, this paper focuses on F-measure.

Table 8 The rates of malware families in 2016’s samples. 6.4 Experiment


family rate To evaluate our method, we conduct the four following exper-
1 O97M/Donoff 65.4%
iments. Each experiment corresponds to each research question
2 New malware families 14.6%
3 None 7.1% described in Section 1.
4 O97M/Madeba 5.3% Experiment 1
5 W97M/Thus 3.4%
6 W97M/Marker 1.7% After extracting words from both benign and malicious
7 O97M/Farheyt 1.3% macros, our method selects the important words. To select
8 O97M/Macrobe 0.4% these words, our method has 3 options as follows.
9 W97M/Adnel 0.2%
10 O97M/Bartallex 0.2% • Select words from malicious macros
• Select words from benign macros
macros. Table 6 shows the numbers of samples. • Select words from both macros
This dataset was collected and provided by Virus Total [18]. Note that the selected words are extracted from both benign
We selected all VBA macros whose file extensions were doc, and malicious macros in any options. In this experiment, we
docx, xls, xlsx, ppt, and pptx. These samples were uploaded to attempt these 3 options with BoW and SVM. To adjust the
Virus Total between 2015 and 2016 for the first time. The mali- number of dimensions, important words are selected with
cious samples are judged malicious by a rate of more than 50% TF. The purpose of this experiment is investigating the most
anti-virus vendors. The benign samples are judged benign by the effective method to extract words and construct a corpus. In
whole anti-virus vendors. We investigated the rates in Septem- the following experiments, our method uses the best method
ber 2018. This means that anti-virus vendors have plenty of time to construct a corpus.
for analyzing. Therefore, we assume the rates are partially re- Experiment 2
liable. However, some malicious VBA macros for APT attacks Our method replaces some patterns into single words. To
might not be shared with all anti-virus vendors. Hence, we chose evaluate the effectiveness, our method has 2 options as fol-
these thresholds for sample selection. There is no overlap in these lows.
specimens. We use 2015’s samples as training data and 2016’s • Replace some patterns into single words (Replaced)
samples as test data in the following experiments. In the follow- • Do not replace (Unreplaced)
ing experiments, we assume that the present time is the end of In this experiment, we attempt these methods with BoW and
2015. In the end of 2015, 2016’s samples were unseen samples. SVM. To adjust the number of dimensions, important words
At that time, many anti-virus programs with the latest definition are selected with TF. The purpose of this experiment is eval-
probably could not detect these samples. Because, anti-virus pro- uating the effectiveness of the replacement process.
grams need samples to update the definition, and these samples Experiment 3
were uploaded during 2016 for the first time. Subsequently, these This experiment compares the combinations of the language
samples had been analyzed and finally labeled. models and classifiers. In this experiment, we attempt BoW
Table 7 and Table 8 illustrate the malware families and and Doc2vec as the language models, SVM, RF, and MLP
rates in the datasets. These malware families are defined by as the classifiers. To adjust the number of dimensions, im-
Microsoft Defender [19]. The representative malware family is portant words are selected with TF. The purpose of this
097M/Donoff in the both datasets. In comparison with 2015’s experiment is investigating the best combination of the lan-
samples, 2016’s samples contain 14.6% of new malware families. guage models and classifiers. In the following experiment,


c 2019 Information Processing Society of Japan 559
Journal of Information Processing Vol.27 555–563 (Sep. 2019)

Fig. 2 The classification accuracy of the methods with BoW and SVM. Fig. 4 The classification accuracy of the SVM with BoW and Doc2vec.

Fig. 3 The classification accuracy of the replaced method and unreplaced


method with BoW and SVM. Fig. 5 The classification accuracy of the RF with BoW and Doc2vec.

our method uses the best combination.


Experiment 4
The final experiment compares the methods to select impor-
tant words. TF and TFIDF are used to adjust the number of
dimensions. The purpose of this experiment is investigating
the most effective method to select important words.

6.5 Result
Experiment 1
Figure 2 shows the classification accuracy of the 3 meth-
ods with BoW and SVM. The horizontal axis corresponds
to the dimensions, and the vertical axis corresponds to the
F-measure. Fig. 6 The classification accuracy of the MLP with BoW and Doc2vec.
As a result, extracting words from malicious macros is effec-
tive. Therefore, our method uses this method to construct a vertical axis corresponds to the F-measure.
corpus in the following experiments. Overall, the F-measure of Doc2vec is higher than BoW in
Experiment 2 Fig. 4 and Fig. 6. In contrast, the F-measure of BoW is
Figure 3 shows the classification accuracy of the replaced higher than Doc2vec in Fig. 5. In Fig. 6, the best F-measure
method and unreplaced method with BoW and SVM. The achieves 0.93. Moreover, the F-measure with MLP and
horizontal axis corresponds to the dimensions, and the verti- Doc2vec is stable. The combination of SVM and Doc2vec is
cal axis corresponds to the F-measure. also stable and quite good. Therefore, we conclude the best
As a result, replacing these patterns into single words is ef- combination is MLP and Doc2vec.
fective. Therefore, our method replaces these patterns into Experiment 4
single words in the following experiments. Figure 7 shows the classification accuracy of the methods
Experiment 3 with TF and TFIDF. The horizontal axis corresponds to
Figure 4, Fig. 5, and Fig. 6 show the classification accuracy the dimensions, and the vertical axis corresponds to the F-
of the combinations of the language models and classifiers. measure.
The horizontal axis corresponds to the dimensions, and the As a result, TF is more effective than TFIDF. The best F-


c 2019 Information Processing Society of Japan 560
Journal of Information Processing Vol.27 555–563 (Sep. 2019)

Table 10 The new malware families in the 2016’s samples.


family family
1 JS/Swabfex 15 W97M/Nsi
2 O97M/Zinunlate 16 X97M/ShellHide
3 O97M/Vibro 17 W97M/Qncwan
4 X97M/Mailcab 18 W97M/Groov
5 W97M/Avosim 19 W97M/Agent
6 X97M/Laroux 20 Win32/Bitrep
7 O97M/Powmet 21 W97M/Walker
8 XM/Laroux 22 W97M/Broxoff
9 O97M/Pyordonofz 23 Win32/Occamy
10 W97M/Xaler 24 Win32/Skeeyah
11 O97M/Emulasev 25 O97M/Prikormka
12 O97M/Bancarobe 26 W97M/Ursnif
13 Gen 27 Win32/Tiggre
14 O97M/DarkSnow 28 O97M/Pollwer
Fig. 7 The classification accuracy of the methods with TF and TFIDF.
Table 11 Frequent words in malicious macros.
measure achieves 0.93. word ratio in malicious ratio in benign
elementofarray 99.0% 43.0%
7. Discussion andchr 93.9% 28.0%
next 90.9% 27.9%
7.1 Detecting Unseen Malware Families function 85.1% 18.3%
We investigated new malware families in the 2016’s samples. string 83.3% 25.7%
len 79.0% 14.7%
Table 10 shows the families. public 77.5% 17.7%
Our method detected 89% of the new malware families. There- longchr 73.7% 19.7%
createobject 73.0% 6.6%
fore, our method can detect unseen malicious macros which con-
error 73.0% 20.7%
tain new malware families. The main functions of malicious byte 56.1% 1.5%
macros are downloading and executing the main body. New mal- callbyname 51.3% 0.1%
ware samples have to contain these functions to some extent. Be-
cause, they require the main body which has many sophisticated 7.3 The Best Combination of the NLP Techniques and Clas-
functions. Therefore, our method can effectively detect new mal- sifiers
ware samples. In the third experiment, Doc2vec was more effective than BoW
Our method could not detect some malicious macros. These for classifying macros. This result might have depended on the
malicious macros are not obfuscated. Hence, one possible reason word order or meaning. The F-measure with MLP and Doc2vec
is that these macros do not contain the typical patterns described was stable, and the best F-measure achieved 0.93. The combina-
in Section 3. These macros contain suspicious SQL commands tion of SVM and Doc2vec was also stable and quite good. MLP
and URLs. These suspicious words do not frequently occur in and SVM have something in common. These classifiers perform
benign macros. If we replace these suspicious words into single quick pattern classification by linear separation. Therefore, we
words, the detection rate might be improved. conclude that Doc2vec and linear classifiers are effective for clas-
sifying macros.
7.2 Frequent Words in Malicious Macros In the first and second experiments, we used BoW and SVM,
In the first experiment, extracting words from malicious which are the most basic and fundamental algorithms. We did not
macros was the most effective. To reveal the reason, Table 11 evaluate with RF, MLP, and Doc2vec. Hence, there is some possi-
shows the frequent words in malicious macros. The ratio is cal- bility that these combinations might achieve better results. SVM,
culated by dividing “the number of samples which contain the however, obtained the most stable results. Therefore, it appears
word” by “the number of samples”. that the rough results make little difference.
The left half of the table contains frequent words in malicious
macros. The right half corresponds to the word frequency in be- 7.4 Comparison
nign macros. Malicious macros include many of these frequent The purpose of our method is detecting unseen malicious VBA
words. In contrast, benign macros rarely include these words ex- macros. In practical use, many methods which contain our
cept some words. Therefore, classifiers can easily discriminate method can only use previous samples for training, and the test
between malicious and benign macros. This also explains why samples should not be the previous samples. If test samples con-
TFIDF was less effective than TF to classify macros. TFIDF val- tain previous samples, it is not possible to evaluate the perfor-
ues increase, if a word appears rarely in an entire corpus. These mance appropriately. Therefore, appropriate experimental condi-
frequent words, however, frequently appear in malicious macros. tions with enough samples are required.
Hence, these TFIDF values decrease. That is why TFIDF was Bearden et al. proposed a method to detect malicious VBA
less effective than TF. Furthermore, these frequent words con- macros with machine learning and NLP techniques [3]. Their
tain some replaced words such as “elementofarray” or “longchr”. method uses traditional machine learning and NLP techniques,
This might be one possible reason that replacing some patterns and achieved 96.3% classification accuracy. However, the sam-
into single words was effective. ples were only 40 malicious and 118 benign MS Office files. To


c 2019 Information Processing Society of Japan 561
Journal of Information Processing Vol.27 555–563 (Sep. 2019)

evaluate their method in practical use, more samples are required. [4] Kim, S., Hong, S., Oh, J. and Lee, H.: Obfuscated VBA Macro De-
tection Using Machine Learning, DSN, pp.490–501, IEEE Computer
Kim et al. proposed a method to detect malicious VBA macros Society (2018) (online), available from http://ieeexplore.ieee.org/xpl/
with feature vectors from obfuscated source code [4]. Their mostRecentIssue.jsp?punumber=8415926.
method might not be able to detect non-obfuscated malicious [5] Miura, H., Mimura, M. and Tanaka, H.: Discovering New Malware
Families Using a Linguistic-Based Macros Detection Method, 2018
VBA macros. Our method uses not only features in obfuscated Sixth International Symposium on Computing and Networking Work-
macros, but also other features. Moreover, they conducted cross- shops (CANDARW), pp.431–437 (online), DOI: 10.1109/
CANDARW.2018.00085 (2018).
validation with thousands of samples to evaluate their method. [6] Miura, H., Mimura, M. and Tanaka, H.: Macros Finder: Do You Re-
The details of the samples are not described in their paper. How- member LOVELETTER?, Proc. Information Security Practice and
Experience - 14th International Conference, ISPEC 2018, pp.3–18
ever, in reality, their method can only use previous samples for (online), DOI: 10.1007/978-3-319-99807-7 1 (2018).
training. Hence, cross-validation is not appropriate in this case. [7] Naser, A., Hjouj Btoush, M. and Hadi, A.: Analyzing and Detect-
ing Malicious Content: DOCX Files, International Journal of Com-
Their method might not detect unseen malicious macros which puter Science and Information Security (IJCSIS), Vol.14, pp.404–412
contain new malware families. We evaluated our method with (2016).
[8] Otsubo, Y., Mimura, M. and Tanaka, H.: O-checker: Detection of Ma-
thousands of samples, and used only previous samples for train- licious Documents through Deviation from File Format Specifications,
ing. We described details of our samples, and indicated that our Black Hat USA (2016).
[9] Boldewin, F.: Analyzing MSOffice malware with OfficeMalScanner,
method could detect unseen malicious macros which contain new 30th July (2009).
malware families. [10] Mimura, M., Otsubo, Y. and Tanaka, H.: Evaluation of a Brute Forcing
Tool that Extracts the RAT from a Malicious Document File, AsiaJ-
8. Conclusion CIS, IEEE Computer Society, pp.147–154 (2016) (online), available
from http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=
7781470.
In this paper, we propose a method to detect unseen malicious
[11] Corona, I., Maiorca, D., Ariu, D. and Giacinto, G.: Lux0R: Detec-
macros themselves. To investigate the source code, we focus tion of Malicious PDF-embedded JavaScript code through Discrimi-
on NLP techniques. Our method divides the source code into nant Analysis of API References, Proc. 2014 Workshop on Artificial
Intelligent and Security Workshop, AISec 2014 (2014), Dimitrakakis,
words, and extracts feature vectors from the corpus with BoW C., Mitrokotsa, A., Rubinstein, B.I.P. and Ahn, G.-J. (Eds.), pp.47–57,
and Doc2vec. Our method selects important words with TF and ACM (2014) (online), available from http://dl.acm.org/citation.
cfm?id=2666652.
TFIDF to improve accuracy. Then, our method uses basic clas- [12] Liu, D., Wang, H. and Stavrou, A.: Detecting Malicious Javascript
sifiers to detect unseen macros. Experimental results show that in PDF through Document Instrumentation, DSN, pp.100–111, IEEE
Computer Society (2014) (online), available from http://ieeexplore.
our method can detect 89% of new malware families, and the ieee.org/xpl/mostRecentIssue.jsp?punumber=6900116; http://www.
best F-measure achieves 0.93. Doc2vec represents feature vec- computer.org/csdl/proceedings/dsn/2014/2233/00/index.html.
[13] Le, Q.V. and Mikolov, T.: Distributed Representations of Sentences
tors effectively, and the best combination of NLP techniques and and Documents, Proc. 31th International Conference on Machine
classifiers is Doc2vec and MLP. Contrary to our expectations, TF Learning, ICML 2014, pp.1188–1196 (2014) (online), available from
http://jmlr.org/proceedings/papers/v32/le14.html.
is more effective than TFIDF in classifying macros. [14] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J.:
In this paper, we used both malicious and benign samples ob- Distributed Representations of Words and Phrases and their Com-
positionality, Advances in Neural Information Processing Systems
tained from Virus Total. We assumed these samples represented 26: Proc. 27th Annual Conference on Neural Information Process-
all VBA macros on the Internet. We selected the whole VBA ing Systems 2013, pp.3111–3119 (2013) (online), available from
http://papers.nips.cc/paper/5021-distributed-representations-of-
macros whose file extensions were doc, docx, xls, xlsx, ppt, and words-and-phrases-and-their-compositionality.
pptx. Hence, we believe these malicious samples mostly repre- [15] olevba, available from https://github.com/decalage2/oletools/wiki/
olevba.
sent the population of malware samples. More benign samples, [16] gensim topic modelling for humans, available from
however, might have to be collected to represent the population. https://radimrehurek.com/gensim/.
For future work, we should evaluate our method with other sam- [17] scikit-learn Machine Learning in Python, available from
https://scikit-learn.org/.
ples. To derive more reliable results, samples should be obtained [18] Virus Total, available from https://www.virustotal.com/.
from other sources. More latest malware samples should be in- [19] Windows Defender Antivirus, available from https://www.microsoft.
com/en-us/windows/windows-defender/.
vestigated. However, as we mentioned previously, these latest
samples should be analyzed on a long-term basis. It seems to
take more time to label precisely. Developing a practical detec-
tion system is another task for future work.

References
[1] Wolf in sheep’s clothing: A SophosLabs investigation into delivering
malware via VBA, available from https://nakedsecurity.sophos.com/
2017/05/31/wolf-in-sheeps-clothing-a-sophoslabs-investigation-into-
delivering-malware-via-vba/.
[2] Nissim, N., Cohen, A. and Elovici, Y.: ALDOCX: Detection of
Unknown Malicious Microsoft Office Documents Using Designated
Active Learning Methods Based on New Structural Feature Extrac-
tion Methodology, IEEE Trans. Information Forensics and Security,
Vol.12, No.3, pp.631–646 (2017).
[3] Bearden, R. and Lo, D.C.-T.: Automated microsoft office macro mal-
ware detection using machine learning, 2017 IEEE International Con-
ference on Big Data, BigData 2017, pp.4448–4452, IEEE (2017) (on-
line), available from http://ieeexplore.ieee.org/xpl/mostRecentIssue.
jsp?punumber=8241556.


c 2019 Information Processing Society of Japan 562
Journal of Information Processing Vol.27 555–563 (Sep. 2019)

Mamoru Mimura received his B.E. and


M.E. in Engineering from National De-
fense Academy of Japan, in 2001 and
2008 respectively. He received his Ph.D.
in Informatics from the Institute of Infor-
mation Security in 2011 and M.B.A. from
Hosei University in 2014. During 2001–
2017, he was a member of the Japanese
Maritime Self Defense Forces. During 2011–2013, he was with
the National Information Security Center. Since 2014, he has
been a researcher in the Institute of Information Security. Since
2015, he has been with the National center of Incident readiness
and Strategy for Cybersecurity. Currently, he is an Associate Pro-
fessor in the Department of C.S., National Defense Academy of
Japan.

Hiroya Miura received his B.E. and


M.E. in Engineering from National De-
fense Academy of Japan, in 2013 and
2019 respectively. Currently, he is a mem-
ber of the Japanese Ground Self Defense
Forces.


c 2019 Information Processing Society of Japan 563

You might also like