Detecting Unseen Malicious VBA Macros With NLPTechniques
Detecting Unseen Malicious VBA Macros With NLPTechniques
2019)
[DOI: 10.2197/ipsjjip.27.555]
Regular Paper
Abstract: In recent years, the number of targeted email attacks which use Microsoft (MS) document files has been in-
creasing. In particular, malicious VBA (Visual Basic for Applications) macros are often contained in the MS document
files. Some researchers proposed methods to detect malicious MS document files. However, there are a few methods
to analyze malicious macros themselves. This paper proposes a method to detect unseen malicious macros with the
words extracted from the source code. Malicious macros tend to contain typical functions to download or execute the
main body, and obfuscated strings such as encoded or divided characters. Our method represents feature vectors from
the corpus with several NLP (Natural Language Processing) techniques. Our method then trains the extracted feature
vectors and labels with basic classifiers, and the trained classifiers predict the labels from unseen macros. Experimental
results show that our method can detect 89% of new malware families. The best F-measure achieves 0.93.
Keywords: VBA macro, machine learning, natural language processing technique, bag-of-words, Doc2vec, TFIDF
pattern files, and the pattern files have to be updated. Most attack-
1. Introduction ers, however, obfuscate programs to evade detection. Therefore,
In recent years, email has become one of the most popular com- it is difficult to detect unseen malicious macros that contain new
munication tools. This situation has led to targeted email attacks malware families.
becoming a big threat to society. A targeted email attack is a spe- To detect unseen malicious macros, some previous methods
cific attack in which the attacker attempts to persuade a victim can be applied. For instance, Nissim et al. [2] analyzed the struc-
to run specific action. Depending on the specific action, there ture of docx files, and proposed a method to detect malicious
are two types of targeted email attacks. One is to open malicious docx files. These previous methods detect malicious MS docu-
links and download a malicious program, and the other is to open ment files. These methods, however, do not discriminate between
malicious attachments. Attackers attempt to earn credibility with malicious macros and benign macros. Hence, if a MS document
their victims through an eloquent mail text. Moreover, the at- file contains benign macros, these methods might detect the be-
tackers convince victims to unknowingly download a malicious nign macros as malicious ones. If the malicious MS document
file or click-through to a malicious site. According to a report file is camouflaged with the structure of a benign MS document
published by Sophos [1], malicious attachment files are used in file, the attacker can probably evade the detection method. De-
most targeted email attacks. The report shows that 85% of the tecting malicious macros themselves can overcome these weak-
attached files are Microsoft Office (MS) document files. Further- nesses. However, there are a few methods to analyze malicious
more, most of the MS document files contain malicious VBA (Vi- macros themselves [3], [4]. Hence, there is room for improve-
sual Basic for Applications) macros. VBA is an implementation ment on these methods.
of Microsoft’s programming language Visual Basic 6, and built This paper proposes a method to detect unseen malicious
into most Microsoft Office applications. Malicious macros have macros themselves. Malicious VBA macros tend to contain typ-
a long history. For example, the LOVELETTER worm, one of the ical functions to download or execute the main body, and ob-
most famous malicious macro infected more than 45 million com- fuscated strings such as encoded or divided characters. To in-
puters, and some organizations suffered serious damage in 2000. vestigate the source code, we focus on NLP (Natural Language
Subsequently, malicious macros gradually faded out. They grew Processing) techniques. NLP techniques are usually used to an-
in popularity again with the rise of targeted email attacks. Macros alyze natural languages as the name indicates. In this paper, we
are a powerful tool to automate common tasks in MS document presume VBA macros are written in a natural language, and at-
files. However, malicious macros use this functionality to infect tempt to learn the difference between benign and malicious VBA
the computer. In targeted attacks, attackers often use unseen mali- macros with Doc2vec. Doc2vec is an unsupervised algorithm that
cious macros which are not detected by anti-virus programs with learns fixed-length feature representations from variable-length
the latest definition. In general, anti-virus programs require virus pieces of texts. Then we input the extracted feature vectors with
the label into supervised learning models to classify benign and
1
National Defense Academy, Yokosuka, Kanagawa 239–8686, Japan malicious VBA macros. The key idea of this research is reading
a)
[email protected]
c 2019 Information Processing Society of Japan 555
Journal of Information Processing Vol.27 555–563 (Sep. 2019)
VBA macros as a natural language. To the best of our knowl- 2.1 MS Document File
edge, Doc2vec has never been applied to detecting malicious Nissim et al. proposed a framework (ALDOCX) that classi-
VBA macros. Doc2vec enables extracting feature vectors from fies malicious docx files using various machine learning classi-
VBA macros automatically. That is why we focus on Doc2vec. fiers [2]. ALDOCX creates feature vectors from the path struc-
Our method uses some NLP techniques to investigate the ture of docx files. Naser et al. proposed a method to detect mali-
macro’s source code. Our method divides the source code into cious docx files [7]. The method parses the structure of docx files,
words, and represents feature vectors from the corpus with sev- and analyzes suspicious keywords. These methods do not support
eral NLP techniques. Term Frequency (TF) and Term Frequency- Compound File Binary (CFB) file format. Our method, however,
Inverse Document Frequency (TFIDF) are used to select im- supports MS document files which conform to Compound File
portant words for classification. TF is a simple method which Binary (CFB) file format and Office Open XML (OOXML) file
weights the value corresponding to frequency of words in a cor- format.
pus. TFIDF is a more sophisticated method which weights a rep- Otsubo et al. proposed a tool (O-checker) to detect malicious
resentative word in a corpus. Bag-of-Words (BoW) and Doc2vec document files (e.g., rtf, doc, xls, pps, jtd, pdf) [8]. O-checker
represent feature vectors from the corpuses. BoW is a basic detects malicious document files which contain executable files,
method that represents vectors corresponding to the frequency of using deviation of file format specifications. Boldewin imple-
the words. Doc2vec is a more complicated model that represents mented a tool (OfficeMalScanner) to detect MS document files
vectors from the context of the documents. Our method then uses which contain malicious shellcodes or executable files [9]. The
the extracted feature vectors and labels, and trains basic clas- tool scans entirely malicious files, and detects features of strings
sifiers, Support Vector Machine (SVM), Random Forests (RF) of Windows API, shellcode patterns, and embedded OLE data.
and Multi Layer Perceptron (MLP). Finally, the trained classi- This tool scores each document corresponding to each of the fea-
fiers predict the labels from unseen macros. Experimental results tures. If the scores are more than a threshold, this tool judges
show that our method can detect 89% of new malware families. the file as malicious. Mimura et al. proposed a tool to deobfus-
The best F-measure achieves 0.93. cate embedded executable files in a malicious document file (e.g.,
This paper addresses 4 research questions as follows. doc, rtf, xls, pdf) and detect them [10]. These methods focused
( 1 ) Does our method detect unseen malware families? on embedded malicious executable files or shellcodes, and do not
( 2 ) Does Doc2vec represent feature vectors effectively? detect malicious macros. Our method on the other hand, detects
( 3 ) What is the best combination of these NLP techniques and malicious macros.
classifiers?
( 4 ) Does TFIDF select important words to classify macros? 2.2 VBA macro
In order to address these questions, we conduct some experi- There are a few methods to detect malicious macros. Bearden
ments. Based on the results, this paper makes the following con- et al. proposed a method of classifying MS Office files containing
tributions: VBA macros as malicious or benign using the K-Nearest Neigh-
( 1 ) Propose a method to detect unseen malicious macros which bors machine learning algorithm, feature selection, and TFIDF
contain new malware families [5]. where p-code opcode n-grams compose the file features [3]. This
( 2 ) Doc2vec is effective in classifying malicious macros [6]. study achieved 96.3% file classification accuracy. However, the
( 3 ) Linear classifiers are effective for Doc2vec [6]. samples were only 40 malicious and 118 benign MS Office files.
( 4 ) Reducing words using Term Frequency is effective for clas- This paper provides more reliable results with thousands of dis-
sifying macros [6]. tinct samples. Kim et al. focused on obfuscated source code and
We will introduce the structure of this paper. Section 2 intro- proposed a method to detect malicious macros with a machine
duces related work and reveals the differences between this pa- learning technique [4]. This method extracts feature vectors from
per and other relevant study. Section 3 describes malicious VBA obfuscated source code. Therefore, this method might not detect
macros, and Section 4 presents some NLP techniques. Section 5 malicious VBA macros which are not obfuscated. Our method
proposes the method, and Section 6 conducts experiments. Sec- uses not only features in obfuscated macros, but also other fea-
tion 7 discusses the results, and finally, describes conclusion. tures.
c 2019 Information Processing Society of Japan 556
Journal of Information Processing Vol.27 555–563 (Sep. 2019)
3. Malicious VBA Macro macros. Method 1 replaces class names, function names et al.
with random strings. The random strings tend to be more than
3.1 Behavior
20 characters. Method 2 encodes and decodes strings with ASCII
This section describes the behavior of malicious macros, and
code. VBA macros provide AscB function and ChrB function.
reveals their features. Attackers use a slick text of the type that
AscB function encodes characters to ASCII codes. ChrB func-
the victim expects, and induces the victim to open an attachment.
tion decodes ASCII codes to characters. Method 3 encodes and
When the victim opens the attachment and activates the macro,
decodes characters by XOR operation. Method 4 divides a string
the macro compromises the computer. There are two types of
into characters. The divided characters are assigned to variables.
malicious macros, Downloader and Dropper.
By adding together those variables, the original string is restored.
Downloader is a malicious macro which forces the victims
Method 5 uses reflection functions which execute the strings as
computer to download the main body. When Downloader con-
instructions. These strings contain function names, class names
nects to the server, it tends to use external applications. Finally,
and method names. Attackers often conceal malicious functions
the computer downloads and installs the main body from the
with these techniques.
server.
In contrast, Dropper contains the main body in itself. When 4. NLP Technique
a victim opens the attachment of a phishing email, Dropper ex-
4.1 Bag-of-Words
tracts the code and executes it as an executable file. The differ-
BoW is a basic method to extract feature vectors from a doc-
ence between Dropper and Downloader is that Dropper contains
ument. BoW represents the frequency of a word in a document,
the main body in itself. Therefore, Dropper can infect victims
and extracts matrix from documents. In this matrix, each row cor-
without communicating with external resources.
responds to each document, and each column corresponds to each
Malicious macros tend to contain functions to download or ex-
unique word in documents. This method does not consider word
tract the main body in the source code. Hence, our method at-
order or meaning. In this method, the number of unique words
tempts to detect these features.
corresponds to the dimension of matrix. Thus, the more num-
ber of unique words increases, the more the number of matrix
3.2 Typical Function
dimensions increases. Therefore, methods to adjust the number
To detect these features, we focus on typical functions de-
of dimensions are required. To adjust the number of dimensions,
scribed in the source code. Table 1 shows the typical func-
important words have to be selected.
tions which are frequently described in Downloader and Drop-
per. CreateObject function returns a temporary object which
4.2 Term Frequency-Inverse Document Frequency
is part of an external application function. For example, if
Term Frequency-Inverse Document Frequency (T FIDF) is
CreateObject function parses “InternetExplorer.Application”
one of the most popular methods for selecting important words.
as an argument, the function accesses Internet Explorer. Shell
We will introduce how a T FIDF value is calculated.
function enables to execute an argument as a file name. SendKey
D
statement and Declare statement also appear in the Downloader. T FIDFi, j = f requencyi, j × log2
document f requencyi
SendKey statement sends keystrokes to the active window as if
typed at the keyboard. Attackers use SendKey statement with The f requencyi, j (TF) is the frequency of a word i in a docu-
Shell function to execute any commands. Declare statement is ment j. The document f requencyi is the frequency of documents
used to declare references to external procedures in a dynamic- in which the word i appears. The IDF is the logarithm of a value
link library. The Declare statement allows accessing a variety in which D (the number of total documents) is divided by the
of functions. CustomProperties represents additional informa- document f requencyi . T FIDF value is a value which is a prod-
tion, and the information can be used as metadata. Attackers fre- uct of T F and IDF. Finally, T FIDF values are normalized. In
quently use CustomProperties collection to conceal malicious this model, if a word appears rarely in an entire corpus and ap-
binary code. pears frequently in a document, the TFIDF value increases.
c 2019 Information Processing Society of Japan 557
Journal of Information Processing Vol.27 555–563 (Sep. 2019)
c 2019 Information Processing Society of Japan 558
Journal of Information Processing Vol.27 555–563 (Sep. 2019)
c 2019 Information Processing Society of Japan 559
Journal of Information Processing Vol.27 555–563 (Sep. 2019)
Fig. 2 The classification accuracy of the methods with BoW and SVM. Fig. 4 The classification accuracy of the SVM with BoW and Doc2vec.
6.5 Result
Experiment 1
Figure 2 shows the classification accuracy of the 3 meth-
ods with BoW and SVM. The horizontal axis corresponds
to the dimensions, and the vertical axis corresponds to the
F-measure. Fig. 6 The classification accuracy of the MLP with BoW and Doc2vec.
As a result, extracting words from malicious macros is effec-
tive. Therefore, our method uses this method to construct a vertical axis corresponds to the F-measure.
corpus in the following experiments. Overall, the F-measure of Doc2vec is higher than BoW in
Experiment 2 Fig. 4 and Fig. 6. In contrast, the F-measure of BoW is
Figure 3 shows the classification accuracy of the replaced higher than Doc2vec in Fig. 5. In Fig. 6, the best F-measure
method and unreplaced method with BoW and SVM. The achieves 0.93. Moreover, the F-measure with MLP and
horizontal axis corresponds to the dimensions, and the verti- Doc2vec is stable. The combination of SVM and Doc2vec is
cal axis corresponds to the F-measure. also stable and quite good. Therefore, we conclude the best
As a result, replacing these patterns into single words is ef- combination is MLP and Doc2vec.
fective. Therefore, our method replaces these patterns into Experiment 4
single words in the following experiments. Figure 7 shows the classification accuracy of the methods
Experiment 3 with TF and TFIDF. The horizontal axis corresponds to
Figure 4, Fig. 5, and Fig. 6 show the classification accuracy the dimensions, and the vertical axis corresponds to the F-
of the combinations of the language models and classifiers. measure.
The horizontal axis corresponds to the dimensions, and the As a result, TF is more effective than TFIDF. The best F-
c 2019 Information Processing Society of Japan 560
Journal of Information Processing Vol.27 555–563 (Sep. 2019)
c 2019 Information Processing Society of Japan 561
Journal of Information Processing Vol.27 555–563 (Sep. 2019)
evaluate their method in practical use, more samples are required. [4] Kim, S., Hong, S., Oh, J. and Lee, H.: Obfuscated VBA Macro De-
tection Using Machine Learning, DSN, pp.490–501, IEEE Computer
Kim et al. proposed a method to detect malicious VBA macros Society (2018) (online), available from http://ieeexplore.ieee.org/xpl/
with feature vectors from obfuscated source code [4]. Their mostRecentIssue.jsp?punumber=8415926.
method might not be able to detect non-obfuscated malicious [5] Miura, H., Mimura, M. and Tanaka, H.: Discovering New Malware
Families Using a Linguistic-Based Macros Detection Method, 2018
VBA macros. Our method uses not only features in obfuscated Sixth International Symposium on Computing and Networking Work-
macros, but also other features. Moreover, they conducted cross- shops (CANDARW), pp.431–437 (online), DOI: 10.1109/
CANDARW.2018.00085 (2018).
validation with thousands of samples to evaluate their method. [6] Miura, H., Mimura, M. and Tanaka, H.: Macros Finder: Do You Re-
The details of the samples are not described in their paper. How- member LOVELETTER?, Proc. Information Security Practice and
Experience - 14th International Conference, ISPEC 2018, pp.3–18
ever, in reality, their method can only use previous samples for (online), DOI: 10.1007/978-3-319-99807-7 1 (2018).
training. Hence, cross-validation is not appropriate in this case. [7] Naser, A., Hjouj Btoush, M. and Hadi, A.: Analyzing and Detect-
ing Malicious Content: DOCX Files, International Journal of Com-
Their method might not detect unseen malicious macros which puter Science and Information Security (IJCSIS), Vol.14, pp.404–412
contain new malware families. We evaluated our method with (2016).
[8] Otsubo, Y., Mimura, M. and Tanaka, H.: O-checker: Detection of Ma-
thousands of samples, and used only previous samples for train- licious Documents through Deviation from File Format Specifications,
ing. We described details of our samples, and indicated that our Black Hat USA (2016).
[9] Boldewin, F.: Analyzing MSOffice malware with OfficeMalScanner,
method could detect unseen malicious macros which contain new 30th July (2009).
malware families. [10] Mimura, M., Otsubo, Y. and Tanaka, H.: Evaluation of a Brute Forcing
Tool that Extracts the RAT from a Malicious Document File, AsiaJ-
8. Conclusion CIS, IEEE Computer Society, pp.147–154 (2016) (online), available
from http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=
7781470.
In this paper, we propose a method to detect unseen malicious
[11] Corona, I., Maiorca, D., Ariu, D. and Giacinto, G.: Lux0R: Detec-
macros themselves. To investigate the source code, we focus tion of Malicious PDF-embedded JavaScript code through Discrimi-
on NLP techniques. Our method divides the source code into nant Analysis of API References, Proc. 2014 Workshop on Artificial
Intelligent and Security Workshop, AISec 2014 (2014), Dimitrakakis,
words, and extracts feature vectors from the corpus with BoW C., Mitrokotsa, A., Rubinstein, B.I.P. and Ahn, G.-J. (Eds.), pp.47–57,
and Doc2vec. Our method selects important words with TF and ACM (2014) (online), available from http://dl.acm.org/citation.
cfm?id=2666652.
TFIDF to improve accuracy. Then, our method uses basic clas- [12] Liu, D., Wang, H. and Stavrou, A.: Detecting Malicious Javascript
sifiers to detect unseen macros. Experimental results show that in PDF through Document Instrumentation, DSN, pp.100–111, IEEE
Computer Society (2014) (online), available from http://ieeexplore.
our method can detect 89% of new malware families, and the ieee.org/xpl/mostRecentIssue.jsp?punumber=6900116; http://www.
best F-measure achieves 0.93. Doc2vec represents feature vec- computer.org/csdl/proceedings/dsn/2014/2233/00/index.html.
[13] Le, Q.V. and Mikolov, T.: Distributed Representations of Sentences
tors effectively, and the best combination of NLP techniques and and Documents, Proc. 31th International Conference on Machine
classifiers is Doc2vec and MLP. Contrary to our expectations, TF Learning, ICML 2014, pp.1188–1196 (2014) (online), available from
http://jmlr.org/proceedings/papers/v32/le14.html.
is more effective than TFIDF in classifying macros. [14] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J.:
In this paper, we used both malicious and benign samples ob- Distributed Representations of Words and Phrases and their Com-
positionality, Advances in Neural Information Processing Systems
tained from Virus Total. We assumed these samples represented 26: Proc. 27th Annual Conference on Neural Information Process-
all VBA macros on the Internet. We selected the whole VBA ing Systems 2013, pp.3111–3119 (2013) (online), available from
http://papers.nips.cc/paper/5021-distributed-representations-of-
macros whose file extensions were doc, docx, xls, xlsx, ppt, and words-and-phrases-and-their-compositionality.
pptx. Hence, we believe these malicious samples mostly repre- [15] olevba, available from https://github.com/decalage2/oletools/wiki/
olevba.
sent the population of malware samples. More benign samples, [16] gensim topic modelling for humans, available from
however, might have to be collected to represent the population. https://radimrehurek.com/gensim/.
For future work, we should evaluate our method with other sam- [17] scikit-learn Machine Learning in Python, available from
https://scikit-learn.org/.
ples. To derive more reliable results, samples should be obtained [18] Virus Total, available from https://www.virustotal.com/.
from other sources. More latest malware samples should be in- [19] Windows Defender Antivirus, available from https://www.microsoft.
com/en-us/windows/windows-defender/.
vestigated. However, as we mentioned previously, these latest
samples should be analyzed on a long-term basis. It seems to
take more time to label precisely. Developing a practical detec-
tion system is another task for future work.
References
[1] Wolf in sheep’s clothing: A SophosLabs investigation into delivering
malware via VBA, available from https://nakedsecurity.sophos.com/
2017/05/31/wolf-in-sheeps-clothing-a-sophoslabs-investigation-into-
delivering-malware-via-vba/.
[2] Nissim, N., Cohen, A. and Elovici, Y.: ALDOCX: Detection of
Unknown Malicious Microsoft Office Documents Using Designated
Active Learning Methods Based on New Structural Feature Extrac-
tion Methodology, IEEE Trans. Information Forensics and Security,
Vol.12, No.3, pp.631–646 (2017).
[3] Bearden, R. and Lo, D.C.-T.: Automated microsoft office macro mal-
ware detection using machine learning, 2017 IEEE International Con-
ference on Big Data, BigData 2017, pp.4448–4452, IEEE (2017) (on-
line), available from http://ieeexplore.ieee.org/xpl/mostRecentIssue.
jsp?punumber=8241556.
c 2019 Information Processing Society of Japan 562
Journal of Information Processing Vol.27 555–563 (Sep. 2019)
c 2019 Information Processing Society of Japan 563