A Study of Polymorphic Virus Detection: November 2018
A Study of Polymorphic Virus Detection: November 2018
net/publication/329327300
CITATIONS READS
0 2,020
1 author:
Vinh T. Nguyen
Texas Tech University
28 PUBLICATIONS 63 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Vinh T. Nguyen on 30 November 2018.
Abstract
Traditional viruses were computer programs with static structure
exhibiting very limited functionality. Once identified for the first
time, their structure is utilized by antivirus (AV) software as a tool
for detecting the similar viruses with similar patterns. However,
modern viruses are smart enough to self-configure and even change
the pattern of their functionality making it hard for AV software
detecting them. A polymorphic virus is a complicated computer
virus that affects data types and functions making it difficult to in-
spect its internal structure. In this paper, we conduct a study of the
polymorphic virus to answer three research questions: (1) What are
the general techniques employed by these viruses to exhibit poly-
morphism? (2) What is the state-of-the-art of detecting polymor-
phic viruses? And (3) What should be made to help antivirus soft-
ware detect these viruses? The result of this study may provide a
good source of knowledge for polymorphic researchers and anti-
virus software company getting the overall picture of this virus and Figure 1: A little poem message annoying users
thus provides a suitable solution to the problem.
1 Introduction
With the advent of science and technology, a computer has been
one of the most advanced devices over centuries that helps human
perform sophisticated work and save data. It is being used in our
daily life activities using desktop computers, laptops, tablets, smart
phones and hand-held devices. In the early day, computers are
mostly used to speed up calculations by a set of instructions with
limited storage capacity. Later on, this architecture was expanded to
store data inside storage devices such as floppy disk, optical disk,
hard disk, memory stick and so on. In competitive markets, this
data may contain highly sensitive information and becomes on of
the favorite targets for many attackers and tons of malicious pro- Figure 2: Anti-virus signatures based detection
grams were written to favor this data. These malicious codes are
known by many different names such as a virus, malware, botnet,
trojan, etc. for different purposes (i.e., for fun, for evil, or even for code. Traditional method avoided detection by not modifying the
good). They are often operated by inserting or attaching themselves ”last modified” date of the host file when it was infected. Other
to another host program. virus, for example Chernobyl Virus [Christodorescu and Jha 2006]
One typical harmless virus was known as Elk Cloner [Spafford et al. utilized the unused areas of executable files by overwriting them
1989] virus written by Richard Skrenta, a 15-year-old high school with malicious code, this allows keeping the same size of infected
student, around 1982 which displayed a little poem on the screen. files. Another more advanced technique, Conficker [Porras et al.
It did not damage any resources on computer but annoying people 2009] terminated the tasks associated with the anti-virus software
with the message as shown in Figure 1. This virus was able to before it was detected.
spread to infect another operating system running Apache II. As Operating Systems keep updating that do not allow to modify
Inspired by understanding the biological evolution and self- the files or kill process without proper authorization, the virus au-
production, John von Neumann [Von Neumann and Burks 1996] thors had to use another technique to hide their programming codes.
created the first self-replicating computer programs to be known in The first technique was known as self-modification [Anckaert et al.
the history. This program can be considered the foundation of many 2006], this technique was developed to counter the anti-virus soft-
modern virus. ware (AV) that scans the virus by signature as depicted in Figure 2.
Virus can penetrate into host computers in many different ways, Basically, the AV will maintain a database that contains a list of
for example by email, text message attachments, social links, free signatures for every detected virus. When it scan a file, it com-
apps, fun images, audio, video files. Once it was triggered, it stayed pares the file’s signature with its signature database, once a string
dormant and infect other computers in the networks. To avoid be- is matched this file is considered to be infected then this file can be
ing detected, the virus author used various techniques to stealth the deleted, locked or cleaned (remove the signature). To avoid detec-
tion, the virus modified itself with a new signature on every infected
∗ e-mail:[email protected] file which can be shown as follow:
repeat N times { 2.1 Static Analysis
i n c r e a s e A by one
do s o m e t h i n g w i t h A Signature based approach: Signature detection [Griffin et al.
when STATE h a s t o s w i t c h { 2009] is the simplest method and is the most widely used for tra-
r e p l a c e t h e opcode ” i n c r e a s e ” above ditional malware detection. This method constructs a database that
with t h e opcode t o d e c r e a s e , contains signatures of all known malware. When analyzing a new
or vice versa programming code, it compares the signature of the analyzed virus
} with its database, if the matching is found, the analyzed file is con-
} sidered as virus. This approach is fast and has high positive rate,
however the database needs to be updated with new signature. Al-
though this technique is old but it was used in the early days of
By doing this way, virus authors can create an infinite number of polymorphic detection when investigator/researcher analyzed the
signatures. virus manually, one by one, line by line to detect various sequences
of programming codes [Bondarenko and Shterlayev 2006]. As the
Another method to avoid signature detection is through encryption, number of virus has been increasing so fast, this technique quickly
this technique uses simple encryption method to cipher the body of becomes time-consuming, expensive and impractical.
malicious code. Each encryption key will produce an encrypted
text, so the virus can replicate itself to many different files by only System call analysis: Sung et al. [Sung et al. 2004] proposed the
modifying the encryption key. Each infected file will contain an Static analyzer for vicious executable (SAVE) to detect malware,
encrypted malicious code, decryption module and encryption key. mostly focus on polymorphic and metamorphic virus which run on
The unique encrypted malicious code will result in a different sig- Windows Operating System. This method works based on the as-
nature, thus make it difficult for AV to detect. The main drawback sumption that all malware variants share a common core signature
of this technique is the decryption module which remains constant - a combination of several features of the programming code. In
through all infected files, opening a possible way for AV software their method, two critical steps were involved: First, the Portable
to detect. Executable (PE) decompressed and passed through a parser, this
parser produced a list of Windows API calling sequence. Second,
In order to overcome the limitation of encryption method with a this API sequence will be compared against the signature database,
constant decryption module, a new technique was developed to a similarity measure was used to conclude the analyzed file. If the
make the decryption module from static to dynamic, that is, this similarity is greater than a certain threshold, the detection is trig-
module will be modified in each infection. This method is called gered.
polymorphic code [Torrubia-Saez 2003]. This polymorphic virus
has become one of the most challenging task for AV software to de- Control-flow graph:: Graphs are also used in static analysis
tect since it is a self-encrypted virus and is able to duplicating itself [Christodorescu and Jha 2006] and [Bonfante et al. 2007] where
by creating slightly modified versions of itself. a set of control flow graphs (CFG) were constructed and reduced
(where possible) and be used as a signature database. This method
A more advanced technique is metamorphic code [Borello and Mé works based on the assumption that the control flow graph of the
2008] in which the virus completely rewrite itself on every execu- malware was not modified in most of the mutation engines. Detec-
tion. However, this method is extremely expensive because it re- tion is carried out by comparing the sub-GFGs of the malicious file
quires a metamorphic engine, making it impractical in practice. against the signature database to find if any sub-CFG is matched
with the database. However, this method does not work when an-
Hence, in this study, we focus on understanding the polymorphic alyzing the metamorphic virus (example Zmist [Szor and Ferrie
virus by addressing the following research questions: 2001] because this virus can change the code itself for each exe-
cution or changes to the branching structures of that flow graph.
• Q1: What are the general techniques employed by these
viruses in order to exhibit polymorphism? Model checking: This method assumes that systems have finite
state or may be reduced to finite state by abstraction. Serge
• Q2: What is the state-of-the-art of detecting polymorphic Chaumette et al [Chaumette et al. 2011] used context-free gram-
viruses? mars as viral signatures and a process was designed to extract the
simple virus signature. This method was based on two assump-
• Q3: What should be made in order to help antivirus software tions: First, most mutating engines generate code belonging to a
detect these viruses? language that is low complexity, that is, belonging to either natu-
ral language or context-free language. Second, the mutation engine
The rest of the paper is organized as follows: section 2 reviews the has to be embedded inside the self-replicating malware, hence it is
state-of-the-art of detecting polymorphic viruses. section 3 presents feasible to extract the grammar of the mutation engine via a static
the general techniques employed by these viruses in order to exhibit analysis.However, this method is very time-consuming. Another
polymorphism. section 4 shows the potential approach to help an- study was presented by Gerald R. Thompson and Lori A. Flynn
tivirus software detect these viruses. And section 5 concludes our [Thompson and Flynn 2007], they compared the program hierar-
paper with recommendations. chical structure and mapped this structure to a context-free gram-
mar, normalizes the grammar, and finally, they used a fast check
2 Literature Review for homomorphism between the normalized grammars. This tech-
nique is resilient despite polymorphism that reorders instructions
,rewrites instructions, inserts instructions, or removes instructions.
Typically, to understand the pattern and behavior of a malicious This approach did not address encrypted files but can be applied af-
program, two general approaches are used in analysis: (1) static ter the file is decrypted if the unencrypted virus is suspected to be
analysis, and (2) dynamic analysis. Static analysis involves analyz- polymorphic.
ing binary signatures of the malware without executing it; whereas,
dynamic analysis observes the behavior of the running malicious Data-flow analysis: This method gathers information about the
code in a controlled environment. possible set of values of objects and variables involved in the spec-
imen. Agrawal, Hira, et al. [Agrawal et al. 2012] proposed a Mal- puter system. The suspect code was executed in the first emulator
ware Abstraction Analysis (MAA) method. They used two stages extension. During this emulation, the system identifies whether the
to derive semantic signature of a binary instance: First, all func- suspect code is likely to exhibit malicious behavior.
tions was analyzes and abstracting away all unnecessary control
flow artifacts from their flow graphs. Second, all local, function In another work, Ignor [Muttik 2004] presented an apparatus for
level signatures were combined into a single, global signature while detecting malicious software by analyzing patterns of system calls
abstracting away all call and return specific artifacts. This method generated during emulation. The malicious file was executed in
is resistant to such large scale, global transformations. an isolated environment, the system calls pattern will be recorded
and compared against database containing suspect patterns of sys-
Machine learning analysis: In recent years, machine learning has tem calls. Based upon the comparison result, the system identifies
gained its popularity in many fields including security. Robert whether the software is likely to exhibit malicious behavior.
Moskovitch et. al. [Moskovitch et al. 2008] proposed a technique
that monitors a small set of features that are sufficient for detect- Stepan [Stepan 2005] proposed a method to detect malware by dis-
ing malware without sacrifice accuracy. The result of the study assembling the malicious code dynamically then compiling this
showed that, only using 20 features, the mean detection accuracy code to target the CPU host, the execution file will be executed
was greater than 90 percent, and for specific unknown worms, this safely on the host CPU. The code obtained can be used to com-
accuracy get over 99 percent, while maintaining a low level of false pared with the original cost. This method increases the analysis
positive rate. The advantage of machine learning techniques is that speed significantly.
it will not only detect a known malware but also act as a database
for detecting new malware. Similar studies can also be found in 3 The polymorphic virus
another model such as Naive Bayes [Alazab et al. 2011], Decision
Tree, Neural Network [Moskovitch et al. 2008]. Although this tech- The first polymorphic virus was written by Mark Washburn in 1990
nique is practical but it may not replace the standard detection meth- [Szor 2005], it was known as 1260 or V2PX virus because of its
ods, rather than act as an add-on feature because machine learning length (1260 bytes). Inspired by Ralph Burger’s publication and de-
techniques are computational and may not be suitable for end users. rived from the original Vienna virus, Mark wished to show the anti-
viral community why identification string scanners did not work in
2.2 Dynamic Analysis all cases. The length of the infected files will be increased by 1,260
bytes and be encrypted. The encryption key changes with each in-
Trevor YannOleg Petrovsky [Yann and Petrovsky 2006] proposed fection. The V2PX was not resident inside the memory, it infects
an architecture to detect polymorphic virus, this architecture in- *.COM files in the current or PATH directories upon execution Two
cludes three components: (1) an emulator that emulates a selected sliding keys were used to decrypt the virus body, but more impor-
number of instructions of the computer program, (2) an operational tantly, junk instructions were inserted into the decryptor. These
code analyzer that analyzes a plurality of registers/flags accessed instructions were useless in the code. They worked as a camour-
during emulated execution of the instructions and (3) an heuristic flag for the code. Depending on the number of inserted junk code,
analyzer that determines a probability that the computer program the decryptor can be shorter or longer. Furthermore, each group of
contains viral code based on an heuristic analysis of register/flag instructions within the decryptor can be permutated in any order,
state information supplied by the operational code analyzer. thus decryptor’s structure can change. Figure 3 shows an example
of decryptor. It can be seen from Figure 3 that, in each group of
Polychronakis et al. [Polychronakis et al. 2006] presented a heuris- instructions, a set of junk instructions are inserted (INC SI, CLC,
tic detection method that scans network traffic streams for the pres- NOP, and other do-nothing instructions)
ence of polymorphic shellcode. This algorithm relied on a fully-
blown IA-32 CPU emulator that makes the detector immune to The next milestone development of polymorphic virus was the ad-
runtime evasion techniques such as self-modifying code. Each in- vent of Mutation Engine (MtE) [Bontchev 1992], this engine was
coming request was executed in a virtual environment. Their algo-
rithm focused on identifying the decryption process that takes place
during the initial execution steps of a polymorphic shellcode. The
study result showed that the proposed approach is more robust to
obfuscation techniques like self-modifications. One limitation of
this approach was that it detected only polymorphic shellcodes that
decrypt their body before executing their actual payload, it did not
capture the shellcode that did not perform any self-modifications.
Antony et al. [Rogers et al. 2012] proposed an apparatus to detect
malicious code that uses calls to an operating system to damage
computer systems. This method will be creating an artificial mem-
ory region, this region may span one or more components of the op-
erating system. The malicious file will be executed and the method
try to detect whether the executable code attempts to access the ar-
tificial memory region. The method may comprise determining an
operating system call that the emulated code attempted to access,
and monitoring the operating system call to determine whether the
code is viral.
Another apparatus was presented by Igor et al. [Muttik and Long
2005] where they patched additional program instructions into an
emulator for detecting suspect code. During operation,a first emu-
lator extension was loaded into the emulator then the suspect code
was loaded into an emulator buffer within a data space of a com- Figure 3: An Example Decryptor of 1260
written by the Bulgarian Dark Avenger.The idea of the mutation en- • Level 1: To generate a polymorphic virus, a scheme is cho-
gine was based on modular development.The concept of MtE was sen from a set of encryption/decryption schemes. An instance
to make a function call to the MtE function and passed control pa- of the virus will have one of these schemes in plain text as
rameters in predefined registers. The MtE will build a polymorphic shown in Figure 5. The public key for this encryption can
shell around the simple virus inside it. When a virus uses the engine be distributed to many takers to encrypt the message. This
to write itself to a file, the MtE encryptor modifies the virus code so simple is so called ”semi-polymorphic”.
it will look like random garbage. The decryptor will ungarble this
code once it is executed. The decryptor is the one part of the virus
that remains unencrypted. When an infected file is run, the decryp-
tor first gains control of the system, then decrypts both the virus
body and the MtE. Then, it will transfer control of the system to the
virus, which in turn will locate a new file to infect. The parameters
to the MtE engine include the following:
• A work segment
• A pointer to the code to encrypt
• Length of the virus body
• Base of the decryptor Figure 5: A simple semi-polymorphic virus method
• Entry-point address of the host
• Target location of encrypted code • Level 2: Virus decryption routine contains one or several con-
stant instructions, the rest is changeable as shown in Figure 6,
• Size of decryptor (tiny, small, medium, or large) the algorithm using the variables A and B but not the variable
• Bit field of registers not to use C, allowing C to be changed endlessly.