Defeating Polymorphism Beyond Emulation
Defeating Polymorphism Beyond Emulation
40 VIRUS BULLETIN CONFERENCE OCTOBER 2005 ©2005 Virus Bulletin Ltd. No part of this reprint may be
reproduced, stored in a retrieval system, or transmitted in any form without the prior written permission of the publishers.
DEFEATING POLYMORPHISM: BEYOND EMULATION STEPAN
Of course, the code could be executed in a controlled happens because there will always be a small percentage of
environment, such as a virtual machine, but a general-purpose clean files for which the emulator will never be able to
virtual machine is very complex software. Including one in an determine that they are clean, no matter how much it will
AV product would require software emulation for the full analyse. Of course, the time limit can be adjusted
system environment, from device drivers to system APIs, dynamically: the emulator could increase it if suspicious
which is simply too much overhead. behaviour is detected or decrease it otherwise. However, there
None of the methods described above is able to detect new are instances in which legitimate programs are encrypted with
malware in a generic way. Their purpose is to just decrypt the same encryption engines as certain viruses, or viruses use
polymorphic viruses, so that signature-based detection can be code that looks benign, etc. Even with adjusting the time limit
used. While it’s possible to develop dedicated routines that are dynamically, there will have to be a hard limit, to avoid
able to detect entire malware families (and often new variants having the emulator analysing a file in an infinite loop. This
as well), writing a routine that is able to analyse an arbitrary means that it will always be possible for a virus writer to
program and determine malicious behaviour is not feasible. determine what this limit is for a particular AV engine and
write a virus that would need to be emulated longer than that
Use of emulation solves all the above problems. Potential in order to be detected. Viruses such as Win32://Coke,
malicious code runs in a controlled, simulated environment. KME-based, etc. would take unreasonably long to emulate,
Hardware resources such as CPU registers are modelled using but they would still decrypt and replicate on the real machine
software data structures. The behaviour of each instruction is in a reasonable time, because native execution is typically
reproduced by a set of software routines designed to update hundreds of times faster than emulation.
the corresponding data structures, in the same way the
instructions would update the hardware resources when
3. DYNAMIC TRANSLATION
executed on a real CPU. Each instruction is first decoded, in
order to find the instruction type, length, operands that need The ‘dynamic translation’ method described below offers the
to be updated, etc. After this, the appropriate emulation same flexibility as emulation, while improving performance
routine is called to update the data structure describing significantly. It relies on disassembling the code to be
hardware resources. The address of the next instruction is analysed dynamically and translating it into functionally
obtained either as a result of instruction decoding or equivalent code that is safe to execute on the host machine.
computed by the emulation routine (in the case of branch The executable code obtained as a result of the translation is
instructions). persisted; if the code is executed inside a loop, the persisted
The emulation process usually starts at the program’s entry code can, in most cases, be executed directly, without
point and instructions are emulated sequentially until a requiring retranslation.
malware signature is found, the emulator is able to conclude Disassembling and translating an instruction requires a
that the program is not malicious or emulator resources are computational effort that is comparable to emulating an
exhausted. In addition, the emulator has to decide when to instruction; executing the obtained code is typically slower
scan for malware signatures and what data to scan, call the than executing the original instruction, but much less so than
scanner, collect and analyse data obtained during emulation emulating the instruction. If a code sequence is executed in a
for behaviour-based heuristic detection or for the purpose of loop, the code will be translated and executed at the first loop
deciding that the program is not malicious. iteration, and for all subsequent iterations the persisted code
Emulation can be used to decrypt any encrypted code, obtained at the first iteration will be executed. Thus, the
regardless of the complexity of the encryption algorithm, method eliminates redundant analysis of repeating code
given that the decryption code is available and the emulator is sequences. Compared to emulation, the time required to
provided with enough resources to complete the decryption. complete the first loop iteration would be approximately the
Providing resources such as memory is not very difficult, as same, while the subsequent iterations will take considerably
the requirement is comparable with that of the analysed less time.
program (the overhead caused by the emulator’s internal data
structures is typically negligible). However, emulating code is 3.1. Partitioning the code into blocks
significantly slower than running the code on a CPU that can One of the problems that the implementation of the DT
execute it natively. This limitation is impossible to overcome, engine has to address is determining whether translated code
because for each emulated instruction an emulator has to is available for any given instruction, and if so, locating the
execute hundreds of instructions to decode it, update the corresponding code. One possible solution is maintaining a
internal data structures, decide if scanning is needed, decide if table with virtual addresses of translated instructions and
more instructions should be emulated or not, find the address addresses of corresponding executable code. However,
of the next instruction, etc. Typically, emulation is hundreds searching a virtual address in this table for each processed
of times slower than execution. instruction is computationally very expensive, negating the
An emulator in an AV engine is required to analyse any given speed advantage of executing translated code. A much more
file in a finite time, during which it must determine if the file efficient way of solving this issue would be to partition the
is malicious or not. When the maximum allowed time for a original code into blocks of instructions and only store a table
file has elapsed and no malware signatures have been entry for each block. This way, the table would have
detected, the emulator must stop the analysis and conclude significantly fewer entries and searching would need to be
the given file is not malicious. It is always possible that the performed for each block, as opposed to each instruction.
maximum time limit was set to be too short for a particular Dividing the original code into blocks cannot be done in an
malware to be detected. On the other hand, increasing this arbitrary way; blocks need to have some specific properties,
time limit will deteriorate the emulator’s average speed. This as described below, that limit the size of each block. On the
VIRUS BULLETIN CONFERENCE OCTOBER 2005 ©2005 Virus Bulletin Ltd. No part of this reprint may be 41
reproduced, stored in a retrieval system, or transmitted in any form without the prior written permission of the publishers.
DEFEATING POLYMORPHISM: BEYOND EMULATION STEPAN
other hand, the bigger the blocks are, the more efficient the that the current BB will always be a successor of the previous
storage and searching will be. BB. This information can be stored, in order to avoid
For the rest of this paper, a ‘basic block’ (BB) will be defined unnecessary searching in the BB address ranges in case these
as a contiguous block of code having a single entry point at blocks are inside a loop. If the previous jump instruction was
the beginning of the block and a single exit point at the end of unconditional, the current BB is the only possible successor;
the block. If the code within such a block is executed via a in the case of a conditional jump instruction, there can be at
call instruction to the beginning address of the block, all the most two different successor blocks. If a block ends with a
instructions in the block will be executed. A single instruction computed jump instruction, the destination address could be
is needed, at the end of the block, to return control to the different each time the block is executed, so in this case
caller. As a consequence, any basic block of original code will determining the successor block requires searching, as
contain at most one jump instruction. Basic blocks that don’t described above, in the list of existing block address ranges. A
contain any jump instructions are valid, but suboptimal. They list of successor blocks determined at previous iterations
could be created as a result of splitting different blocks or for could be also stored and used to speed up searching, in case
other practical reasons such as the DT engine having the number of different possible successor blocks is
insufficient resources to translate a bigger block, etc. Such reasonably small.
blocks don’t need to be treated differently, as we can consider Delimiting the original code into basic blocks as described
that they end with a virtual unconditional jump to the above and maintaining a data structure describing the blocks
following instruction. and the relations between them has several advantages:
Each discovered BB will be described by a set of properties, • If several blocks are executed inside a loop, searching for
such as: a successor block needs to be done only once for each
• Block boundaries; the address of the first instruction in successor; after all the successors of a particular block
the BB and the address immediately following the last have been determined, there is no need to search for a
instruction in the BB will be used to delimit the block; successor of that block at any subsequent loop iteration.
these are linear addresses in the address space of the • The list of beginning addresses for the discovered blocks
original code. can be used as a list of ‘entry points’ to scan for malware
• Executable code obtained as a result of translating the signatures; each discovered BB needs to be scanned
original code in the block. only once.
• Miscellaneous flags, indicating if the block was • It provides data for applying dynamic code
translated or not, if it was scanned for malware optimizations. Since optimizing code is computationally
signatures, if the block has any known successor expensive, blocks that are executed more frequently are
blocks, etc. better candidates for optimizations.
ii The current address is found inside the address range of Given an arbitrary virus, it is extremely difficult to determine
an existing BB. According to our definition, a BB must the exact moment when the above condition is met during the
have a single entry point; therefore this block must be analysis. This is true for both emulation and the dynamic
split. The existing block will be modified such as its translation methods. One possible approach would be to
address range will end at the current address and a new detect decryptor loops and scan the code decrypted by each
BB will be created, starting at the current address. If the such loop at the end of the loop. However, this is not very
split block was already translated, the existing code will easy to implement, as determining decryptor loops, loop exit
not be truncated, but the code starting at the current criteria and range of decrypted code are fairly complex tasks.
address will be retranslated for the newly created block. Also, this method does not guarantee a minimum number of
scans; if the virus has multiple decryption layers, or the same
iii The current address is not found inside the address layer is executed multiple times (i.e. in brute-force
range of any existing BB. In this case, a new BB will be decryption), the code will be scanned redundantly for each
created, starting at the current address. decryption layer. There are also polymorphic viruses for
If the previous BB ended with an immediate jump instruction which the entire decrypted body cannot be found in memory
– for which the destination address is constant – this means at any time (i.e. Dark_Paranoid). In such cases, an instruction
42 VIRUS BULLETIN CONFERENCE OCTOBER 2005 ©2005 Virus Bulletin Ltd. No part of this reprint may be
reproduced, stored in a retrieval system, or transmitted in any form without the prior written permission of the publishers.
DEFEATING POLYMORPHISM: BEYOND EMULATION STEPAN
or a small code sequence is decrypted, executed and then virtual CPU is used instead of the real one. Other hardware
immediately encrypted again. For these viruses, the decrypted resources (I/O ports, IRQ controllers, disk drives, etc.) are
body can be obtained by logging each instruction the first typically virtualized to protect the host machine from any
time it is encountered and sorting the logged instructions by damage.
address. Scanning the resultant log guarantees that the entire
There are multiple ways in which a code translation that
virus code will be scanned for signatures, and no redundant
meets the above criteria can be achieved:
search operations are performed, because no code sequence
will be scanned more than once. i Translating directly from the original code to target
code: each original instruction will be decoded and then
It is generally preferable to extract signatures from the
an equivalent instruction or instruction sequence will be
constant part of a polymorphic virus and not from the
generated for the target code. This is the simplest
decryptor code; therefore the code that is used for decryption
translation method to implement, for any source and
doesn’t necessarily need to be scanned. In practice, however,
target binary languages.
it is typically more computationally expensive to identify the
code used for decryption precisely and exclude it from ii Translating using an intermediate language (IL): each
scanning, than it is to scan it. original instruction will first be translated into an
Another difficulty comes from the fact that one cannot predict intermediate code sequence and then the intermediate
the location in the decrypted code where a signature might be code will be translated to target native code. This
present. Therefore, scanning a chunk of code without any method is preferable when we have multiple sources (S)
knowledge about the code structure requires searching for and multiple target (T) languages. With direct
signatures starting from up to M - N offsets in the given code translation, we would need in this case S * T translators,
chunk, where N is the size of the chunk in bytes and M is the while using an intermediate language we only need
size of the shortest signature searched. S + T translators (S translators from a source language
to IL and T translators from IL to a target language).
Dividing the code into basic blocks, as described in section There are other advantages as well, such as the
3.1., provides a means to scan the entire code that was possibility to perform code optimizations using the IL
analysed, in a reliable way and with minimum number of form. The IL should be platform-independent, but could
signature searches. Having the data structure describing the be designed in a way that favours translation speed for
basic blocks, it is easy to determine, for any particular block, some particular translators (those that are more
whether it is the first time we’re analysing it (in which case frequently used). As a drawback, the IL would need to
we also need to scan it) or whether we have analysed it before support all the possible operators, operand types and
and no scanning is needed. As part of maintaining the BB data combinations of these, from all the source languages,
structure, we are also required to detect when any block that raising its degree of complexity. This could prove to be
has previously been analysed is being overwritten (because if a problem, because translating to or from languages
this happened, the block would need to be retranslated). A with a reduced instruction set is generally faster, per
block can be scanned at any time after it was discovered and translated instruction, than in the case of complex
before it is overwritten, and it needs to be scanned only once. languages.
If we choose, for instance, to use only signatures starting at
the BB boundary, the maximum number of scans needed will iii Combining the above methods could be achieved in a
be the number of unique basic blocks in the code. Of course, a way that preserves the advantages of both, without any
signature could span multiple blocks, in which case all of of their disadvantages. Most instructions could be
these blocks need to be discovered and decrypted, so that the translated using a fairly simple intermediate language,
signature can be detected. If some of the blocks are while the most exotic and complex ones, that would
overwritten before a signature is detected, the code must be also require a complex IL, will be translated directly.
logged as described above. Typically, the code obtained as a result of translation will not
be as efficient as the original code. This may happen for
3.3. Translating code various reasons: some instructions or operand encodings from
the source language might not have a 1:1 correspondence in
Given an arbitrary program code to be analysed by an AV
either the IL or the target language, some hardware resources
engine not only needs the code to be considered unsafe, but it
used in the source language need to be preserved or not used
could also be the case that the code is compiled to run on a
at all in the target language because of specific restrictions,
different hardware platform than the AV engine. Even if the
memory mapping has to be virtualized because the source and
code to be analysed executes natively on the same CPU as the
target languages might use different mappings, etc.
AV engine, it might need to run in a different CPU mode (i.e.
x86 real mode vs. protected mode), have different memory It is possible to perform some optimizations at basic block
mapping mode, execute under a different operating system, level, at translation time, to improve the efficiency of the
etc. Therefore, even if the code was safe or we could translated code. If the translation uses an IL, the best idea
somehow make sure the host machine cannot be damaged, it would be to perform the optimizations on the IL, because the
still wouldn’t be possible to execute correctly any arbitrary algorithms involved will only need to support this language,
part of the given code, from the AV engine process. It is, as opposed to all the source and target languages. The IL
however, possible to translate the given code into another could also be designed to facilitate optimizations such as
code sequence that is functionally equivalent with the original define-usage chains, copy propagation, etc., while the other
one and that can be safely and correctly executed on the host languages might not be very suitable for performing such
machine. In our case, the host CPU will be used to execute optimizations. Also, the IL may be used to pass translation
the translated code directly, while in the case of emulation a ‘hints’ from the source translator to the target translator.
VIRUS BULLETIN CONFERENCE OCTOBER 2005 ©2005 Virus Bulletin Ltd. No part of this reprint may be 43
reproduced, stored in a retrieval system, or transmitted in any form without the prior written permission of the publishers.
DEFEATING POLYMORPHISM: BEYOND EMULATION STEPAN
If the analysed code is linear, the code will be translated once 8 If the current BB was previously translated but the
and executed once. As the execution time is negligible original code was overwritten, discard the translated
compared to translation time, in this case translation would code, as it is no longer valid.
account for most of the analysis time. For clean files that 9 Translate the current BB.
don’t unpack or decrypt themselves and don’t have any
suspicious behaviour, there is usually no need to analyse lots 10 Prepare for executing the translated code for the current
of looping code; in this case, the translators must be BB: save hardware resources that are used by both this
optimized for speed. On the other hand, for polymorphic algorithm and the translated code and need to be
malware and occasional clean files that contain unpacking or preserved, such as CPU registers, etc., depending on
decryption code, most time will be spent analysing loops and implementation.
executing code that is already translated. In this case, the 11 Execute the translated code for the current BB via a call
speed of the code generated by the translators is more instruction; after execution is complete, control will be
important than the translation speed. As improving the returned by the executed code to the caller.
efficiency of the translated code is done at the cost of
12 Restore resources saved at step 10.
translation speed, a compromise between these two must be
obtained. 13 Handle any errors that might have happened during
execution, decide whether to continue analysing the
An example of code translation is given in Appendix A.
next block or stop.
3.4. DT execution flow During execution, the translated code has to check, before
each write to memory operation, whether an existing block
For each file that needs to be scanned for malware, analysis would be overwritten by the data being written. The checking
consists of sequentially identifying and processing of basic may be skipped only if it can be determined, at translation
blocks, as defined in section 3.1. At the beginning of the time, that this particular write operation could never overwrite
analysis, the current address is initialized to the entry point of an existing block. If one or more blocks are overwritten, the
the program to be scanned. After each BB is processed, the corresponding translated code will be marked as ‘dirty’,
current address is updated to the destination address of the meaning that it will be discarded at step 8, when those blocks
jump instruction at the end of this block. The analysis will be processed again. If one of the blocks that were
continues with the BB starting at the new current address, overwritten is the current BB and code beyond the current
until a malware signature is detected or the program is execution point was overwritten, the execution must not be
determined to be clean. allowed to continue to the end of the current BB, as this
In a simplified description, after each BB is analysed, the would mean executing code that is no longer valid. In this
following processing algorithm has to be performed for the case, the write operation is allowed to happen and then
next one: execution is allowed to continue until the entire translated
code sequence corresponding to the current original
1 Look in the data structure describing relationships instruction is executed. This is done because execution cannot
between blocks for a known successor of the last be interrupted in the middle of the code sequence for an
analysed block; if a known successor exists, make this original instruction, otherwise resuming the execution would
block the current BB and continue from step 6. not be possible. After all, overwritten blocks are marked as
2 Search the block address range table for a previously dirty, execution is interrupted and control is returned to the
discovered block starting at the current address. If such caller, with the current address set to the address where
a block is found, make it the current BB and continue execution was interrupted.
from step 5. Handling of exceptions such as division by 0, page fault, etc.
3 If the current address is found inside the address range is done in a similar way: execution is interrupted at the
of an existing BB, split this BB to end at the current address of the instruction that generated the exception; if an
address. Create a new BB starting at the current address, exception handler is present, execution may continue with the
make the new block the current BB and continue from exception handler code. Information is passed to the
step 5. exception handler, enabling it to resume execution at the point
where it was interrupted.
4 If no existing block was found, that either starts or
includes the current address in its address range, create Obtaining the address of the original instruction
a new BB starting at the current address; this will be the corresponding to a given instruction in the translating code
current BB. requires some computational effort. For the original code, the
real CPU keeps an instruction pointer register, updating it
5 Update the data structure describing relationships after each instruction is executed. Reproducing this behaviour
between blocks: store the information that the current in the translated code would mean generating and executing
BB is a successor of the previously processed BB. code for an extra
6 If the current BB was not scanned for signatures, scan <add instruction_pointer, instruction_code_size>
for signatures starting at the current address. If a
operation, for each translated instruction. For simple
signature is found, stop the analysis and report the file
arithmetic instructions, this would mean doubling the
as malicious, otherwise mark the block as scanned.
translated code size and execution time, which is
7 If the current BB is already translated and the code was unacceptable. Therefore, computing the value of the current
not overwritten since last translation, continue from instruction pointer will only be done when needed, if an
step 10. exception happens or the current block is being overwritten.
44 VIRUS BULLETIN CONFERENCE OCTOBER 2005 ©2005 Virus Bulletin Ltd. No part of this reprint may be
reproduced, stored in a retrieval system, or transmitted in any form without the prior written permission of the publishers.
DEFEATING POLYMORPHISM: BEYOND EMULATION STEPAN
For this purpose, a table is used, that relates offsets of fashion as translating from a binary language. Using an
instructions within the current BB to addresses of intermediate language might be more suitable for translating
corresponding instructions, in the original code. from a non-binary language, because it could provide support
Special care has to be taken when calling high-level language for variable number of operands and operand allocation,
compiled functions from the translated code. CPU registers which are required by virtually all script languages.
that might be used by the compiler in global optimizations Using DT to analyse scripts, however, has to deal with some
must be saved at step 10 and restored before calling a specifics. For instance, a script might not be able to modify
high-level language function, in case they were changed by itself while it is being interpreted, but it could generate other
the translated code. The stack pointer and frames must also be script files dynamically and call the interpreter to run these.
preserved. The translated code must also save and restore any In this case, each particular file can be translated statically,
registers computed before the high-level language function is but the analysis is still a dynamic process, because not all of
called, whose values are still needed after the function returns. the component files are available at the beginning of the
analysis. Providing an environment for analysing scripts is
3.5. Environment more difficult than in the case of executable files. Some of
the major challenges are accurately reproducing the
In order to obtain a correct behaviour while analysing behaviour of the script interpreter and of OS commands and
programs, the DT engine must provide access to various utilities that might be called from the script. Different
hardware devices (disk drives, keyboard, mouse, network versions of operating system or script interpreter might
interface, video card, real-time clock, etc.), as well as behave differently.
software resources, such as BIOS data structures and routines
and operating system APIs. If the code provided as input to Dynamic translation can also be used to translate malware
the DT engine could be malicious, most of these resources detection routines from a platform-independent byte code into
need to be virtualized. This requires a lot of development executable code for the host CPU. This way, an AV engine can
effort, given the large number of devices and system APIs that easily be updated with new detection capabilities, without the
need to be supported. However, virtualizing the system need to actually change the engine code. The benefits
environment for a dynamic translation engine is done in provided are smaller engine code, smaller (and faster) updates
almost the same way as in the case of an emulator, so code and less testing effort required for the new routines. The new
reusing is possible if an emulator is available. detection routines will be sent as ‘data’, the same way as
signature/pattern database updates. They will be loaded and
Accessing virtualized devices, as opposed to real ones, translated to native code when the engine is loaded in
offers improved speed performance. A virtual device is in memory and then executed, as needed, with little speed
fact a data structure in memory, which can be accessed much penalty compared to native code generated by a compiler.
faster than a disk drive, for instance. Often there’s no need to
fully implement all the functionality of a device, because no The approach used to translate such routines will be slightly
existing malware would require it. Therefore, a virtual device different from the one used to translate files that are scanned
will be less complex than a real one, making it even faster for malware, because in this case it is already known that the
in operation. detection routine code is not malicious. This allows the use of
a real environment as opposed to an emulated one. Also, there
Real devices may be used in a few cases – for instance, the is no need to scan any of the translated code and there is no
current time could be obtained by using the real-time clock. need to check for overwritten blocks, because the code
However, this may cause the analysed code to behave doesn’t decrypt or otherwise modify itself. All the successors
incorrectly. As the code actually runs slower inside the DT for all basic blocks can be determined when the engine is
engine than it would run natively, time inside the DT engine loaded, which means that we don’t have to search for any
should also pass proportionally slower. Otherwise, it would be successor block at execution time. For these reasons, the
possible for the analysed code to determine, based on this translated code obtained for detection routines will be a lot
inconsistency, that it’s running in a virtual environment. This more efficient than the code typically obtained by translating
is used by malware as an anti-emulation technique. files for the purpose of scanning.
Some malware would function correctly only if the In the case of packed executable malware, unpacking is
environment is configured in a particular way. For instance, typically needed before a signature can be detected. Using
they might require a specific OS version, work only on a signatures extracted from packed code is not always practical,
certain file system type, check for the presence of a specific even for non-polymorphic malware. Some strings or other
file in a known place or work only within a certain calendar data might change inside the binary, upon each replication,
date range, etc. In these cases, it is difficult to configure the causing the packed file to change in such a way that the
environment so that the code will run correctly, as different signature won’t match any more. Detection based on
malware may have different conflicting requirements. packed code is not possible if the packer is polymorphic.
Heuristic or generic detection can be achieved only by using
4. DEVELOPING FURTHER APPLICATIONS unpacked code.
Decrypting code in order to provide signature-based detection Writing unpacking routines for all the packers publicly
for polymorphic executable file infectors is just one possible available takes a lot of development and test effort. In some
application of dynamic translation. The code to be analysed cases, writing an unpacker routine would require reverse
doesn’t need to be a CPU instruction code, it could also be a engineering the packer and there might be some legal
platform-independent byte code, such as MSIL byte code or a restrictions preventing this. In the absence of a dedicated
script. Translating from a non-binary language to native unpacking routine, a packed executable could be emulated
executable code for the host CPU can be done in a similar until the unpacked code is obtained. However, unpacking with
VIRUS BULLETIN CONFERENCE OCTOBER 2005 ©2005 Virus Bulletin Ltd. No part of this reprint may be 45
reproduced, stored in a retrieval system, or transmitted in any form without the prior written permission of the publishers.
DEFEATING POLYMORPHISM: BEYOND EMULATION STEPAN
APPENDIX A
The table below shows an example of translation of a 16-bit
x86 code sequence sample to 32-bit x86 target code, as
generated by the prototype implementation of the Dynamic
Translation method.
46 VIRUS BULLETIN CONFERENCE OCTOBER 2005 ©2005 Virus Bulletin Ltd. No part of this reprint may be
reproduced, stored in a retrieval system, or transmitted in any form without the prior written permission of the publishers.
DEFEATING POLYMORPHISM: BEYOND EMULATION STEPAN
The following format was used for IL instructions: <opcode source_operand_1, source_operand_2, destination_operand>. For
example, ‘add x, y, d’ means ‘d = x + y’. Loadflags and Saveflags are not separate IL instructions; they are encoded in the binary
form of affected IL instructions. In this particular implementation, the IL language accepts simple operands, such as registers and
constants, as well as more complex operands, like registers shifted with a constant, a sum of register, constant or shifted operands,
etc. It is possible to design the IL to only support simple operands, in which case each IL instruction would be simpler and faster
to generate, but more IL instructions would be needed, in average, to translate an original instruction.
VIRUS BULLETIN CONFERENCE OCTOBER 2005 ©2005 Virus Bulletin Ltd. No part of this reprint may be 47
reproduced, stored in a retrieval system, or transmitted in any form without the prior written permission of the publishers.
DEFEATING POLYMORPHISM: BEYOND EMULATION STEPAN
APPENDIX B
The following table shows comparative speed test results,
obtained by benchmarking a prototype implementation of the
dynamic translation method presented in this paper, versus the
emulator in the last version of RAV anti-virus engine. Within
each test, the DT prototype and the emulator analysed the
exact same instructions and scanned for malware with the
same signature set. The best time of three consecutive runs
was selected, for each test.
The files used for the first three tests were infected with
polymorphic file infectors; both the emulator and the DT
prototype had 100% detection rate on these test sets. The test
set for the last test consisted of 100 copies of an executable
file containing a nested decryptor loop – interior loop having
1,020 iterations, exterior loop having 256 iterations.
The benchmark results indicate that:
• The DT method provides a significant speed
improvement over emulation, in all tests.
• In the case of emulation, speed performance is not
affected, in most cases, by the complexity of the analysed
files. The time required to emulate a given code sequence
is proportional with the type and number of instructions
emulated. An emulator takes little or even no advantage
of repeating code sequences (a slight improvement is
noticed for the last test set).
• The speed performance provided by the DT method,
relative to native execution, improves with the average
number of analysed instructions per file, as the
probability of finding repeating code sequences
increases. Thus, detection of heavily polymorphic
viruses, requiring millions of instructions to be analysed,
can be accomplished in a reasonable time by using
dynamic translation.
* the average slowdown rate is defined as the time needed by emulation / DT to analyse a given test code divided by the time
needed by a real CPU to natively execute the same code. An average of 1.5 clock cycles per instruction was used for the purpose of
estimating the total time required for native execution of the files in the test sets. Actual execution time would be difficult to
measure given that infected files were used.
48 VIRUS BULLETIN CONFERENCE OCTOBER 2005 ©2005 Virus Bulletin Ltd. No part of this reprint may be
reproduced, stored in a retrieval system, or transmitted in any form without the prior written permission of the publishers.