0% found this document useful (0 votes)
23 views8 pages

Tang 2017

The document presents S EE AD, a novel semantic-based approach for automated binary code de-obfuscation that minimizes human involvement and increases code coverage. By utilizing dynamic taint and control dependence analysis, S EE AD effectively uncovers hidden malicious behaviors in obfuscated binaries, achieving significant reductions in obfuscation instructions. Experimental results demonstrate its effectiveness across various benign and malicious obfuscated programs, highlighting its potential as a low-cost solution for malware analysis.

Uploaded by

ssindhuwork465
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views8 pages

Tang 2017

The document presents S EE AD, a novel semantic-based approach for automated binary code de-obfuscation that minimizes human involvement and increases code coverage. By utilizing dynamic taint and control dependence analysis, S EE AD effectively uncovers hidden malicious behaviors in obfuscated binaries, achieving significant reductions in obfuscation instructions. Experimental results demonstrate its effectiveness across various benign and malicious obfuscated programs, highlighting its potential as a low-cost solution for malware analysis.

Uploaded by

ssindhuwork465
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2017 IEEE Trustcom/BigDataSE/ICESS

S EE AD: A Semantic-based Approach for


Automatic Binary Code De-obfuscation
Zhanyong Tang† , Kaiyuan Kuang† , Lei Wang† , Chao Xue† , Xiaoqing Gong†∗
Xiaojiang Chen† , Dingyi Fang† , Jie Liu‡ , Zheng Wang§∗
† School of Information Science and Technology, Northwest University, P.R. China.
‡ Tencent Technology (Shenzhen) LI. § School of Computing and Communications, Lancaster University, UK.

Abstract—Increasingly sophisticated code obfuscation tech- human involvement. Their approach does not require security
niques are quickly adopted by malware developers to escape from analysts to manually analyze and identify the obfuscation
malware detection and to thwart the reverse engineering effort techniques used by the malware. As a result, the time spent
of security analysts. State-of-the-art de-obfuscation approaches
rely on dynamic analysis, but face the challenge of low code in malware analysis is reduced greatly. While promising,
coverage as not all software execution paths and behavior will Coogan’s method only can deal with malware that uses
be exposed at specific profiling runs. As a result, these approaches virtulization-based obfuscation tools such as VMProtect [7]
often fail to discover hidden malicious patterns. This paper and Virtualizer [8].
introduces S EE AD, a novel and generic semantic-based de- In this work, we aim to extend the reach of existing
obfuscation system. When building S EE AD, we try to rely on
as few assumptions about the structure of the obfuscation tool malware de-obfuscation techniques. We present S EE AD, a
as possible, so that the system can keep pace with the fast novel and generic automated code de-obfuscation system.
evolving code obfuscation techniques. To increase the code cov- S EE AD is a semantic-based de-obfuscation approach. It makes
erage, S EE AD dynamically directs the target program to execute few assumptions about the structure of obfuscators. Therefore,
different paths across different runs. This dynamic profiling S EE AD can be applied to existing and unknown obfuscation
scheme is rife with taint and control dependence analysis to
reduce the search overhead, and a carefully designed protection methods. S EE AD works by first identifying the semantically
scheme to bring the program to an error free status should relevant instructions with dynamic taint analysis and control
any error happens during dynamic profile runs. As a result, the dependency analysis, and then simplifying the instruction
increased code coverage enables us to uncover hidden malicious traces of the target binary with these analytical results. Because
behaviors that are not detected by traditional dynamic analysis the whole de-obfuscation process of S EE AD does not require
based de-obfuscation approaches. We evaluate S EE AD on a range
of benign and malicious obfuscated programs. Our experimental any human involvement, it significantly reduces the time spent
results show that S EE AD is able to successfully recover the in malware analysis.
original logic from obfuscated binaries. Similar to most de-obfuscation approaches [9], [10],
Index Terms—Malware Analysis, De-obfuscation, Multiple Ex- S EE AD also uses dynamic analysis to characterize the pro-
ecution Paths Exploration gram behavior. However, profiling based dynamic analysis
suffers from poor code coverage because the program execu-
I. I NTRODUCTION
tion path during profiling runs only represents the application
Code obfuscation [1] methods like control flow flattening, behavior for a given set of inputs. As a result, existing dynamic
garbage code insertion, instruction deformation, binary code analysis based de-obfuscation techniques can miss some of
encryption and packing [2], and virtualization obfuscation [3], the malware behaviors that are only triggered under specific
are now commonplace in malware. These code obfuscation cases (e.g., when a particular file is present, or when a certain
techniques make it more difficult to uncover the true logic of command is received). Our approach to the problem is to
the program, giving security analyst an incredibly hard time. dynamically adjust the program control logic to direct the
Most existing de-obfuscation approaches [4], [5] only target program to execute different paths during different profiling
a limited set of specific obfuscation techniques. They work runs to increase the code coverage. Our carefully designed
under the assumption that security analysts have priori knowl- recovery scheme ensures that the program can roll back to an
edge of the structure of obfuscation tools (obfuscators) used error free status if the logic change leads to invalid program
by the malware developer. This means that these approaches execution or corrupted data. To reduce the search space and
require heavily human involvement (which often takes a lot of profiling overhead, we combine taint and control dependence
time and effort) and can only be applied to known obfuscation analysis to only change execution branches that dependent
methods. on the program input and ignore those do not. As a result,
The work presented by Coogan et al. [6] is among the first our scheme achieves higher code coverage with reasonable
attempts to automate malware code de-obfuscation without overhead compared to the state-of-the-art dynamic analysis
based approaches. The increase in code coverage allows us
*Corresponding authors:
Xiaoqing Gong, Email address: [email protected] to uncover more hidden behaviors of malware.
Zheng Wang, Email address: [email protected] We have evaluated S EE AD with a range of benign and

2324-9013/17 $31.00 © 2017 IEEE 261


DOI 10.1109/Trustcom/BigDataSE/ICESS.2017.246
3& ,QGH[ ,QVWUXFWLRQ &DOOOLVW 7DLQWHGOLVW

7DLQW'\QDPLF 3& ,QGH[ ,QVWUXFWLRQ &DOOOLVW 7DLQWHGOLVW

([HFXWHGLQIRUPDWLRQH[WUDFWLRQ
DQDO\VLV

&RQWURO'HSHQGHQF\ 2SWLPL]DWLRQ
$QDO\VLV 3URFHVV
2EIXVFDWHG 2SWLPL]HG
,QWHUPHGLDWH
%LQDU\)LOH LQVWUWUDFHV
'DWD
&)*)&*
&)*DQG)&*
&RQVWUXFWLRQ

0XOWLSOH([HFXWLRQ3DWK
([SORUDWLRQ 3& ,QGH[ 3UH%ORFN 6XFF%ORFN 3RVWGRP6HWV

3& ,QGH[ 3UH%ORFN 6XFF%ORFN 3RVWGRP6HWV


2QOLQH'\QDPLF$QDO\VLV

Figure 1: Overview of S EE AD: The figure shows the framework of S EE AD, the components in yellow are the key functions of S EE AD, the components in
blue are the input and output of S EE AD. Others are some details produced during the de-obfuscation process.

malicious obfuscated binary programs. Experimental results instruction traces, values of registers and memory) of the
show that S EE AD is able to eliminate on average 76.8% of the obfuscated binary file. It is difficult for the common debug-
obfuscation instructions and 87.4% of obfuscation instructions gers (e.g. Ollydbg [11], IDA Pro [12]) to handle packing
generated using virtualization-based obfuscation techniques. and obfuscated malware, because malware developers usually
The main contributions of this work are: use various anti-reverse engineering strategies to increase the
• We present S EE AD, a novel and generic semantic-based difficulty of malware analysis. In addition, the cost of analysis
automated code de-obfuscation system that can apply is not very optimistic. Dynamic binary instrumentation tools
to unknown obfuscation methods without any human are effective against these anti-reverse engineering strategies.
involvement; Thus, we build S EE AD on the top of a dynamic binary
• S EE AD is a low-cost solution but provides wider code instrumentation tool called PIN.
coverage compared to state-of-the-art dynamic analysis
based code de-obfuscation tools; III. S EMANTICALLY R ELEVANT I NSTRUCTION
• Our evaluation performed on a range of benign and ma- I DENTIFICATION
licious obfuscated binaries show that S EE AD is effective In this paper, we use dynamic taint analysis to identify
at removing obfuscation instructions. values obtained through input operations and instructions
II. S EE AD OVERVIEW influenced by these input-tainted values directly and indirectly.
The computation only can capture the explicit information flow
S EE AD is a generic and semantic-based de-obfuscation from inputs to outputs of the program, but does not consider
system. An overview of S EE AD is presented in Figure 1. The the implicit information flow [13], it is possible that some
input of S EE AD is the obfuscated binary file, the output is the behaviors will be missed and the semantics of the program
simplified instruction traces, CFG (Control Flow Graph) and will be changed. To this end, we combine with the explicit
FCG (Function Call Graph), which can be easily analyzed and data dependencies identified earlier to capture implicit as well
understood. as explicit information flow from inputs to outputs.
To perform the code de-obfuscation, S EE AD goes through
the following steps: A. Dynamic Taint analysis
• Extract the executed information of the obfuscated file Dynamic taint analysis [14] is widely applied to program
based on the dynamic binary instrumentation tool. security analysis. The basic idea of dynamic taint analysis is
• Identify semantically relevant instructions with dynamic to mark the users’ sensitive data or untrusted input data as
taint analysis and the control dependence analysis in the taint source and track the taint source’s propagation path
Section 4. during the executed process.
• Present a low-cost solution for exploring multiple execu- Similar to prior work [15], S EE AD uses an one-bit tag (0
tion paths to increase the code coverage in Section 5. for “untaint” data and 1 for “taint” data) for each value in
• Perform the inter-block and intra-block optimization re- memory or general registers in the taint propagation process. If
spectively in Section 6, moreover, S EE AD constructs the necessary, the one-bit tag can be easily extended to a multiple-
CFG and FCG for the optimized instruction traces. bit tag for each value.
In order to identify the semantically relevant instructions, At the beginning of the taint propagation process, all tags
we need to extract the executed information (i.e., assembly are assigned to 0. Based on the taint propagation policy, taint

262
sources (e.g. data read from the network or standard input) $GGUHVV $VVHPEO\,QVWUXFWLRQ
will be tagged with 1 as “taint”. As the program executes,
/ % PRYHE[HD[
the dynamic taint scheduler propagates the tag information
from one instruction to another. It does this by dynamically / % FPSHF[HD[
tracking instructions with information flow. Some other data / % MQ]%)
may be tagged with 1 via information flow. Of course,“taint”
data can become “untaint” if its value is reassigned from some / % PRYHF[[

safety data. / %' MPS&

B. Control Dependency Analysis / %) PRYHF[[

Control dependency analysis is mainly used to capture 


/ & MPS>HG[HF[ @
the implicit information flow during the program execution
Figure 2: An example of control dependencies: Instruction L4 and L6 are
process. For two instructions I and J of a program, J is said explicit control dependent on L3. Instruction L7 is implicit control dependent
to be control-dependent on I if the outcome of I determines on L3.
if J is executed. More formally, J is control dependent on
I only if there is a non-empty path from I to J such that J
IV. M ULTIPLE E XECUTION PATHS E XPLORATION
post-dominates each instruction in this non-empty path except
I [16]. The computation of control dependencies has been Traditional dynamic analysis techniques usually only con-
well-studied in the compiler literature [16]. sider a single execution path which typically represents partial
program behavior. This practice leads to low code coverage,
Algorithm 1 Computing Control Dependencies as some hidden malicious behavior may be missed. To address
Input: An initial tainted instruction trace T this problem, previous works usually explore multiple execu-
Output: The instruction trace T with control dependencies tion paths which depend on some profiling information(e.g.
between instructions identified source code, testing information, etc.). However, in the case
1: Construct an initial control flow graph G of trace T; of malware, we usually do not have access to these profiling
2: Compute the post-dominator relationships of G; information. Moreover, even when the profiling information is
3: Use post-dominator relationships to compute explicit con- available, existing techniques incur higher overhead.
trol dependencies: To this end, S EE AD presents a low-cost solution for
4: (a)TaintC = the set of input-tainted conditional control multiple execution paths exploration. The basic idea is that
transfers; and we extended the analysis tool with the capability to explore
5: (b)DepIns = {x j ∃ C ∈ TaintC : x control dependent on multiple execution paths, and force the binary to execute
C}; with no profiling information. It is a great challenge as the
6: while ∃ an indirect control transfer Ins dependent on some search space of all possible paths is usually very large for real
x ∈ DepIns do world binaries. We know that it is not necessary to obtain all
7: TaintBB ← basic block of Ins in G; the predicates, because the branch outcomes are usually not
8: Mark TaintBB as dependent on the direct control trans- affected by any input. Therefore, we use the results of dynamic
fer in C that x is dependent on; taint analysis and control dependency analysis to reduce the
9: end while search space of the predicates. S EE AD only needs to consider
the tainted branch blocks. The overhead of multiple execution
We consider two types of control flow: explicit and implicit. paths exploration will be reduced greatly.
Explicit control flows are those conditional control transfers A. Path Exploration
where the predicate is explicitly reflected in the instruction
of the control transfer. Here, we can use post-dominators When a conditional branch occurs during the execution of
to compute explicit control dependencies directly. Implicit the program, if the current basic block is marked as a taint, we
control flows are those indirect control transfers such as “jmp will store the current process address space, then the program
[eax]” where the register eax is data-dependent on the taint will continue executing normally. When the process wishes to
sources directly or indirectly. Control dependencies computing terminate later, it replaces the current process address space
algorithm we take is shown in Algorithm 1. with the saved snapshot automatically. Here we need to modify
Figure 2 is an example of two types of control flow: explicit the outcome of the decision such that the process continues
and implicit. The target of conditional control transfer depends its execution along the other branch. Of course, there are a
on which path is taken on L3, therefore, the instruction L4 lot of branches in the program. In this case, the execution
and L6 are explicit control dependent on L3. Moreover, the space is explored by selecting the next snapshot in a depth-
value of register ecx is data dependent on the conditional first order. This technique enables us to automatically extract
jump L3, so the target of the indirect control transfer L7 a more complete view of the program.
also depends on which path is taken on L3, which is implicit Here, we will show how to explore multiple execution paths
control dependencies. of a program in Figure 3. Assume that the block sequence of

263
• Invalid instruction combination: Invalid instruction
$VVHPEO\&RGH /RFDO [% combination is some instructions in the combination of
LQW[

%
%
[ UHDGBLQSXW  PRY>ORFDO@HD[ /RFDO!
which functionally invalided or can cancel each other out.
FPS>ORFDO@[ (e.g., add eax, 0xF; sub eax, 0xF).
LI [! MOHVKRUWWHVW
In order to better analyze and understand the logic of the
LI [!

%
% % H[LW original program, we construct the CFG and FCG for the
/RFDO!
SULQWI “Pass” FPS>ORFDO@[& optimized results.
MOHVKRUWWHVW$
H[LW   CFG and FCG Construction: Construction of CFG and
% 3ULQWH[LW % H[LW  FCG is a basic and highly challenging task for obfuscated
binaries, especially for the identification of indirect jump
Figure 3: An example of multiple execution paths exploration targets and API identification. Since S EE AD is based on
dynamic analysis, the targets of indirect jump instructions are
the first execution as usual by an arbitrary input is B1, B2, precise address which we obtained after dynamic computation.
B4, the blocks saved in the snapshot list are B2, B1. When However, for API identification, there is no standard approach
the current process wishes to terminate, we replace the current in the literature.
process address space with the saved snapshot B2 firstly, and As we all know, system calls play a vital role in malware
then B1. detection. To some extent, API function sequence is a special
representation of malware behavior. Thus, in order to prevent
B. Exception Recovery Mechanism security analysts from extracting the API call sequence and
As we all know, the process runs normally until it exits analyzing the program behavior, malware developers usually
normally or an exception happens. However, S EE AD does not use various API protection techniques to obfuscate system
allow the process to terminate. Because the operating system calls. To construct FCG, we have to develop API identification
will remove the process-related entries and free its memory, techniques against API obfuscation to reveal the information
we will not be able to recover the current image to a saved of API calls (e.g., address of API calls and their details).
snapshot. Moreover, the program input is merely to allow the Common API obfuscation techniques can be roughly classi-
program to execute along the normal execution flow, rather fied into import table encryption, Hook API and API rewriting.
than along different execution paths. It is possible to cause The first two techniques are ineffective to dynamic analysis,
exceptions because of the incorrect input. Thus, we adapt the because the entry point of the API function can always be
exception recovery mechanism to prevent any exceptions. traced in dynamic execution process. However, it is challeng-
In S EE AD, the obfuscated binary program is first executed ing to reveal API sequence from the program obfuscated by
as usual by providing arbitrary input. Its recovery mechanism API rewriting technique. API writing usually copies the first
prevents the program from termination. For the program which few instructions of the API function to the user space to
exit normally, we hook the system API function NtTermi- execute, so we cannot easily identify the entry point of the
nateProcess() of ntdll.dl library to monitor whether the process API function during the execution process. In this paper, we
wishes to terminate. Similarly, for the program crashes, we combine code injection and API hook to monitor the API calls
hook the system API function KiUserExceptionDispatcher() and record their invocation information. Finally, we use these
of ntdll.dll library. Whenever the process invokes the API, collected information of API calls to construct CFG. Since
we can know that a program exception occurs. In this case, if these two are standard techniques, we omit their details.
there are unexplored paths left, we will revert to the program’s
VI. E FFECTIVENESS E VALUATION
current image to a previous state.
A. Effectiveness Analysis
V. O PTIMIZATION P ROCESS
In this subsection, we demonstrate the effectiveness of our
The optimization process is mainly divided into two parts: de-obfuscation approach by elaborating that the analysts have
inter-block and intra-block optimization. For the inter-block only negligible probability of getting the same results with our
optimization, we discard those blocks without taint marked de-obfuscation approach.
which are semantically irrelevant. For the intra-block opti- Let tins denotes the average time of analyzing an instruc-
mization, we make assumptions as few as possible about the tion, let Nobf and Nsimp denote the instruction number of
structure of obfuscators. Thus, we present a set of general but the obfuscated program and the simplified traces respectively.
simpler semantics-preserving transformations as following: Ptime measures how much time we have been able to save
• Stack optimization. There are two cases: a useless push- when analyzing an obfuscated program. It is defined as:
pop couple and an element A is pushed onto the stack Nsimp × tins Nobf − Nsimp
and then popped into an element B. Ptime = 1 − = (1)
Nobf × tins Nobf
• Dead code removal: Dead code are the instructions
whose execution does not modify programs final states For Nobf instructions of obfuscated program, if we want
or control flow. Every instruction of a block in which all to simplify them into Nsimp instructions, this yields a total of
Nobf !
taints get overwritten before being used. (Nobf −Nsimp )! combinations. The security analysts’ probability

264
(N −N )! 
of correctly getting these Nsimp instructions is obfNobfsimp
! .

For a 1536B (Nobf =751 instructions) obfuscated program, the
instructions can be simplified as 31 instructions after calcu- 

lating of our de-obfuscation approach, the security analysts’

6LPSOLILFDWLRQ6FRUH

probability of getting the same results with our de-obfuscation

approach therefore is:

(Nobf − Nsimp )! (751 − 31)!
P [Analysis] = = = 9.68−87 
Nobf ! 751!

The time we have been able to save when analyzing this

obfuscated program is:
Nobf − Nsimp 751 − 31 
Ptime = = = 95.872%
Nobf 751 
ELQBVHDUFK EXEEOHBVRUW KXIIPDQ PDWUL[BPXOW ILERQDFFL IDFWRULDO
B. Experimental Results &)2EIXVFDWRU 0(03 903 &9

We have implemented a prototype of S EE AD which is 2EIXVFDWHG6DPSOHV


implemented in PIN. It supports WIN32 executables. In this Figure 4: Comparison results of simplification scores
section, we present the results of evaluating S EE AD with
six samples obfuscated by four obfuscation tools respectively 

and demonstrate the effectiveness of our approach on multiple
execution paths exploration.
Existing virtualization de-obfuscation techniques first re- 

verse engineer the structure of the virtual interpreter, calculate 'LIIHUHQFH6FRUH
all the bytecode instructions based on this information, finally,
recover the logic embedded in the interpreter. This approach 

is very effective when the interpreter structure we dealt with
meets the certain needs. However, without the assumption on
known interpreter structure, it may not work well. 


VMprotect and Code Virtualizer are two representative ob-


fuscation tools that have been considered in previous work [6],
[5]. However, these researchers usually do not discuss these 


non-virtualization obfuscations (e.g., control flow flattening, ELQBVHDUFK EXEEOHBVRUW KXIIPDQ PDWUL[BPXOW ILERQDFFL IDFWRULDO
instruction deformation, encryption, etc.) so we do not know &)2EIXVFDWRU 0(03 903 &9
whether they are also able to handle the program obfuscated 2EIXVFDWHG6DPSOHV
by these non-virtulaization obfuscations. As far as we know,
Figure 5: Comparison results of difference Scores
none of existing approaches on de-obfuscation can be applied
to most obfuscation techniques. Thus, we present S EE AD The difference score measures the instruction number differ-
which is effective for most obfuscation techniques. In this ence between the original program and the simplified traces.
subsection, we demonstrate the power of S EE AD with four It is defined as:
common obfuscation tools: CF Obfuscator, MEMP [17], |Norig − Nsimp |
Code Virtualizer (CV) [18] and VMprotect (VMP) [19]. CF Dif f erence Score = (3)
Norig
Obfuscator is a binary control flow flattening tool which
realized by control flow algorithm OBFWHKD [20]. MEPE Analysis results of programs obfuscated with these four
combines equivalent deformation, control flow obfuscation and obfuscation tools are in Table I, Table II, Table III and Table IV
dynamic encryption and decryption. We present the results respectively. The first column shows the name of sample.
of evaluating S EE AD with 6 programs, both of which are As showed in the next 3 columns, we report the number of
common obfuscated samples [18]. Because this paper does instructions in the original program, obfuscated program and
not discuss these non-virtulazation obfuscations, we obfuscate simplified traces. The next two columns show the number of
these six programs with the CF Obfuscator and MEMP. total blocks and input-taint blocks respectively. Finally, we
Let Norig , Nobf and Nsimp denote the number of in- present the simplification score and in the last two columns.
structions for the original program, the obfuscated program Figure 4 shows the comparison results of simplification
and the simplified traces respectively. The simplification score scores in all samples. Simplification score introduced by
measures how much obfuscation code we have been able to MEMP is on average about 0.65, which means that S EE AD
eliminate. It is defined as: is able to eliminate about 65% of obfuscation instructions
introduced by CF Obfuscator. Simplification score introduced
Nobf − Nsimp
Simplif ication Score = (2) by CV is over 0.94 on average. The simplification scores
Nobf

265
Table I: Results for programs obfuscated with CF Obfuscator
Original Obfuscated Simplified Total basic Input-taint Simplification Difference
Samples
trace size trace size trace size blocks basic blocks Score Score
bin search 166 221 108 21 19 0.511312 0.3494
bubble sort 316 641 263 22 6 0.589704 0.16772
huffman 4367 7226 833 59 31 0.884722 0.80925
matrix-mult 651 936 479 44 28 0.488248 0.26421
fibonacci 2930 2950 781 20 12 0.735254 0.73345
factorial 132 174 39 13 10 0.775862 0.70455

Table II: Results for programs obfuscated with MEMP


Original Obfuscated Simplified Total basic Input-taint Simplification Difference
Samples
trace size trace size trace size blocks basic blocks Score Score
bin search 166 1325 549 100 70 0.58566 2.307229
bubble sort 316 4529 2054 123 96 0.546478 5.5
huffman 4367 34410 10532 170 125 0.693926 1.411724
matrix-mult 651 8230 3758 317 271 0.543378 4.772657
fibonacci 2930 2999 800 29 17 0.733244 0.72696
factorial 132 255 53 26 17 0.792157 0.59848

Table III: Results for programs obfuscated with VMprotect


Original Obfuscated Simplified Total basic Input-taint Simplification Difference
Samples
trace size trace size trace size blocks basic blocks Score Score
bin search 166 859226 215148 314 220 0.749603 1295.072
bubble sort 316 2371635 501189 215 143 0.788674 1585.401
huffman 4367 5682255 1949412 355 328 0.65693 445.3962
matrix-mult 651 2762309 520705 327 251 0.811496 798.8541
fibonacci 2930 26549 2391 28 26 0.90994 0.18396
factorial 132 12611 1037 25 24 0.91777 6.856061

Table IV: Results for programs obfuscated with Code Virtualizer


Original Obfuscated Simplified Total basic Input-taint Simplification Difference
Samples
trace size trace size trace size blocks basic blocks Score Score
bin search 166 163599 2819 356 52 0.982769 15.98193
bubble sort 316 605079 5872 322 25 0.990295 17.58228
huffman 4367 2553216 158286 369 114 0.938005 35.24594
matrix-mult 651 693674 50283 315 72 0.927512 76.23963
fibonacci 2930 24818 1647 283 62 0.933637 0.43788
factorial 132 12858 1596 271 96 0.875875 11.09091

introduced by CF Obfuscator and VMP lie in the middle. They instruction combination. All of these instructions were identi-
are about 0.67 and 0.81 on average respectively. fied by our analysis.
Similarity, the comparison results of difference scores are We examined the results by hand, and found the reason for
shown in Figure 5. The highest difference score is about 689 the higher simplification scores is that all the test cases we
on average which is introduced by VMP, and the lowest score used are all toy programs. We believe that this paper is just
is over 0.5 on average which is introduced by CF Obfuscator. an initial step on developing an advanced and functionally
Difference scores are introduced by MEMP and CV lies in the powerful de-obfuscation tool.
middle. They are about 2.6 and 26 on average respectively. The results in Table I, Table II, Table III and Table IV
Overall, these results are encouraging, especially for virtu- show that the extraordinary increase in the number of executed
alization obfuscations, as S EE AD only identifies those instruc- instructions for these four obfuscator tools. For example,
tions that are semantically relevant with the original code, and bin search executes 166 instructions in the original program.
discards those that are semantically irrelevant. Our evaluation However, the number of executed instructions of the program
results show that we can straightforwardly reconstruct the logic obfuscated by CF Obfuscator, MEMP, CV and VMP are 221,
of the original program and analyze them correctly with the 1325, 163599 and 659226 respectively.
functionality we have traced. As we all know, traditional dynamic analysis typically
We observed that most of the “missed” instructions were represents partial program behavior and the coverage heavily
classified into two categories: On the one hand, instruc- relies on good inputs which may not be available. We compare
tions which performed some preparatory work like allocating the results of S EE AD with traditional dynamic analysis and
memory or initializing data structures. On the other hand, demonstrate the effectiveness of our approach on multiple exe-
instructions which performed some invalid actions that were cution paths exploration. The comparison results are presented
semantically irrelevant, such as garbage instructions, invalid in Table V. Columns 2-5 present the comparison results of the

266
Table V: Evaluation of multiple execution paths exploration
CF Obfuscator MEMP
Samples Branch Input-taint Dynamic Branch Input-taint Dynamic
S EE AD S EE AD
blocks branch blocks analysis blocks branch blocks analysis
bin search 5 5 203 221 5 5 1301 1325
bubble sort 3 2 641 641 3 2 4529 4529
huffman 15 9 4383 7226 12 9 27325 34410
matrix mult 15 12 840 936 9 7 7119 8230
fibonacci 4 2 2943 2950 4 2 2992 2999
factorial 2 2 141 174 2 1 189 255

 
these techniques in terms of code coverage, the capability
3DFNHG3URJUDP

'\QDPLF$QDO\VLV
of handling packing and obfuscation and scalability. Static

6HH$' analysis usually has good code coverage, and which is very
1XPHURI$3,&DOOV

scalable. However, it is difficult for static analysis to handle



packing and obfuscated program, because some instructions
 of the target binary are dynamic computing. For dynamic

analysis, it usually produces only partial program behavior
and the code coverage sometimes heavily relies on good inputs

which may not be available. For symbolic analysis, it is able
 to construct inputs with the path conditions, but has difficulty
 in handling packed or obfuscated binaries.
It is difficult for S EE AD to model multiple threads into
 
          a single execution since their execution sequence is non-
3DFNHG0DOZDUH deterministic. X-Force [23] adopts a simple and effective
approach to serialize the execution of threads. The calls
Figure 6: Comparison results of API calls
to thread creation library functions are replaced with direct
function calls to the starting functions of threads, which avoid
program obfuscated by CF Obfuscator. Column 2 presents the creating multiple threads and guarantees code coverage at the
number of the branch blocks. The number of branch blocks same time. However, it is ineffective to analyze the behavior
influenced by the input-tainted values is shown in column which is sensitive to schedules. In the future, we will explore
3. We explore multiple execution paths according to these handling the real concurrent executions.
input-taint branch blocks. Columns 4-5 present the instructions
that are covered by different approaches. Particularly, column VIII. R ELATED W ORK
4 shows the number of instructions that are executed by
traditional dynamic analysis. Column 5 shows those extracted a) De-obfuscation mechanisms.: De-obfuscation is not
by S EE AD. Similarly, columns 6-9 present the comparison a new problem, thus, a number of solutions already exist.
results of the program obfuscated by MEMP. Udupa et al. [24] discuss the deobfuscating code that has been
Figure 6 shows the comparison results between traditional obfuscated by control flow flattening [25], which resembles
analysis and S EE AD. From the coverage data, we observed emulation-based obfuscation in some ways. Jones et al. [26]
that S EE AD could cover more instructions than dynamic describe a technique for specializing away interpretive code.
analysis, however, for our test cases, the increase in the number These works are based on static, which are ineffective against
of instructions was less. We examined the results by hand, complex obfuscated binaries, e.g., due to dynamic encryption
and found two reasons. First, we provided the good inputs for and decryption and and self-modifying code.
the test cases in the dynamic analysis, so it can cover most Sharif et al. represent an approach [5] for de-obfuscation, it
instructions. Second, the increase in the number of instructions first reverse engineers the VM emulator, and then use the infor-
was closely related to the function of the obfuscated code. mation to work out individual byte code instructions. However,
In general,from the experimental results, we can ensure that the proposed approach may not work well when the emulator
S EE AD can be used to handle most obfuscations without any uses techniques that do not fit these assumptions.There is
human involvement and at the same time increase the code a semantic-based approach for de-obfuscation. Coogan et
coverage. Effectiveness and efficiency of malware analysis will al. [6] uses equational reasoning about assembly instruction
be improved by S EE AD. semantics to simplify the obfuscation code from execution
traces of virtualization obfuscated programs. It does not seem
VII. D ISCUSSION AND F UTURE W ORK
strightforward to control the whole e-obfuscation process to
Existing binary can be roughly classified into static [21], recover the logic of the program. Moreover, this paper does
dynamic [22], and symbolic analysis [9], [10]. However, all not construct the CFG and FCG for better understanding the
of three techniques have their limitations. Now, we compare de-obfuscated results.

267
b) Multiple Execution Paths Exploration.: Early ap- [8] “Code virtualizer: Total obfuscation against reverse engineering,” Oreans
Technologies, http://www.oreans.com/codevirtualizer.php, Tech. Rep.,
proaches on multiple execution paths exploration usually rely 2008.
on profiling information to construct concrete program in- [9] C. Cadar, D. Dunbar, and D. R. Engler, “Klee: Unassisted and automatic
puts [27], [28], such as source code, software testing and generation of high-coverage tests for complex systems programs.” in
OSDI, vol. 8, 2008, pp. 209–224.
debugging information. Unfortunately, in practice, such in- [10] V. Chipounov, V. Kuznetsov, and G. Candea, S2E: a platform for in-vivo
formation is generally not available. Hence, for malware, the multi-path analysis of software systems. ACM, 2012, vol. 47, no. 4.
assumption can be considered unrealistic. The work in [29] [11] O. Yuschuk, “Ollydbg 1.1: A 32-bit assembler level analysing debugger
for microsoft windows, june 2004.”
requires concrete inputs firstly and then mutate such inputs to [12] “Ida pro: a windows, linux or mac os x hosted,” https://www.hex-
explore different paths which incurs high overhead. rays.com/products/ida/index.shtml, Tech. Rep.
There is an approach for multiple execution paths explo- [13] G. E. Suh, J. W. Lee, D. Zhang, and S. Devadas, “Secure program
execution via dynamic information flow tracking,” in Acm Sigplan
ration in [23] by forcing the branch outcomes to be reversed Notices, vol. 39, no. 11. ACM, 2004, pp. 85–96.
to construct control flow graphs, However, partial paths they [14] P. Saxena, R. Sekar, and V. Puranik, “Efficient fine-grained binary
explored are infeasible. Similar techniques are proposed to instrumentationwith applications to taint-tracking,” in Proceedings of the
6th annual IEEE/ACM international symposium on Code generation and
expose hidden behavior in Android apps [30], [31]. These optimization. ACM, 2008, pp. 74–83.
techniques randomly determine each branch’s outcome, facing [15] A. Lakhotia and E. U. Kumar, “Abstracting stack to detect obfuscated
the challenge of excessive infeasible. calls in binaries,” in Source Code Analysis and Manipulation, 2004.
Fourth IEEE International Workshop on. IEEE, 2004, pp. 17–26.
IX. C ONCLUSIONS [16] A. V. Aho, R. Sethi, and J. D. Ullman, Compilers, Principles, Tech-
niques. Addison wesley, 1986.
This paper has presented S EE AD, a novel, generic frame- [17] W. H. Fang Dingyi, Li Guanghui, “Research on deformation based
work for code de-obfuscation, targeting malware detection. binary,” Journal of Sichuan University (Engineering Science Edition),
2014,1:003.
S EE AD employs dynamic taint analysis and control depen- [18] “Obfuscated samples,” http://www.cs.arizona.edu/projects/lynx/Samples/,
dency analysis to carefully direct the program execution path Tech. Rep.
across profiling runs to increase the code coverage. It then [19] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney,
S. Wallace, V. J. Reddi, and K. Hazelwood, “Pin: building customized
simplifies the instruction traces of the target binary to perform program analysis tools with dynamic instrumentation,” in ACM Sigplan
code de-obfuscation. S EE AD is fully automatic and requires Notices, vol. 40, no. 6. ACM, 2005, pp. 190–200.
little human involvement. We evaluate S EE AD on a range [20] J. Nagra and C. Collberg, Surreptitious Software: Obfuscation, Wa-
termarking, and Tamperproofing for Software Protection. Pearson
of benign and malicious obfuscated programs. Experimental Education, 2009.
results show that S EE AD can successfully recover the original [21] M. Christodorescu and S. Jha, “Static analysis of executables to detect
logic from obfuscated binaries. malicious patterns,” DTIC Document, Tech. Rep., 2006.
[22] J. Zeng, Y. Fu, K. A. Miller, Z. Lin, X. Zhang, and D. Xu, “Obfuscation
X. ACKNOWLEDGMENT resilient binary code reuse through trace-oriented programming,” in
Proceedings of the 2013 ACM SIGSAC conference on Computer &
This work was partial supported by projects of the Na- communications security. ACM, 2013, pp. 487–498.
tional Natural Science Foundation of China (No. 61672427, [23] F. Peng, Z. Deng, X. Zhang, D. Xu, Z. Lin, and Z. Su, “X-force: Force-
executing binary programs for security applications,” in Proceedings of
No. 61572402), the International Cooperation Foundation the 2014 USENIX Security Symposium, San Diego, CA (August 2014),
of Shaanxi Province, China (No.2015KW-003), the Re- 2014.
search Project of Shaanxi Province Department of Education [24] S. K. Udupa, S. K. Debray, and M. Madou, “Deobfuscation: Reverse
engineering obfuscated code,” in Reverse Engineering, 12th Working
(No. 15JK1734), the Service Special Foundation of Shaanxi Conference on. IEEE, 2005, pp. 10–pp.
Province Department of Education (No.16JF028), the Re- [25] C. Wang, J. Davidson, J. Hill, and J. Knight, “Protection of software-
search Project of NWU, China (No.14NW28). Especially this based survivability mechanisms,” in Dependable Systems and Networks,
2001. DSN 2001. International Conference on. IEEE, 2001, pp. 193–
work was supported by Tencent. 202.
[26] N. D. Jones, C. K. Gomard, and P. Sestoft, Partial evaluation and
R EFERENCES automatic program generation. Peter Sestoft, 1993.
[27] X. Zhang, N. Gupta, and R. Gupta, “Locating faults through automated
[1] C. Collberg, C. Thomborson, and D. Low, “A taxonomy of obfuscating
predicate switching,” in Proceedings of the 28th international conference
transformations,” Department of Computer Science, The University of
on Software engineering. ACM, 2006, pp. 272–281.
Auckland, New Zealand, Tech. Rep., 1997.
[28] S. Lu, P. Zhou, W. Liu, Y. Zhou, and J. Torrellas, “Pathexpander:
[2] R. Langner, “Stuxnet: Dissecting a cyberwarfare weapon,” Security &
Architectural support for increasing the path coverage of dynamic
Privacy, IEEE, vol. 9, no. 3, pp. 49–51, 2011.
bug detection,” in Microarchitecture, 2006. MICRO-39. 39th Annual
[3] M. Sharif, A. Lanzi, J. Giffin, and W. Lee, “Automatic reverse engineer-
IEEE/ACM International Symposium on. IEEE, 2006, pp. 38–52.
ing of malware emulators,” in Security and Privacy, 2009 30th IEEE
[29] A. Moser, C. Kruegel, and E. Kirda, “Exploring multiple execution
Symposium on. IEEE, 2009, pp. 94–109.
paths for malware analysis,” in Security and Privacy, 2007. SP’07. IEEE
[4] R. Rolles, “Unpacking virtualization obfuscators,” in 3rd USENIX Work-
Symposium on. IEEE, 2007, pp. 231–245.
shop on Offensive Technologies.(WOOT), 2009.
[30] R. Johnson and A. Stavrou, “Forced-path execution for android applica-
[5] M. G. Kang, P. Poosankam, and H. Yin, “Renovo: A hidden code
tions on x86 platforms,” in Software Security and Reliability-Companion
extractor for packed executables,” in Proceedings of the 2007 ACM
(SERE-C), 2013 IEEE 7th International Conference on. IEEE, 2013,
workshop on Recurring malcode. ACM, 2007, pp. 46–53.
pp. 188–197.
[6] K. Coogan, G. Lu, and S. Debray, “Deobfuscation of virtualization-
[31] Z. Wang, R. Johnson, R. Murmuria, and A. Stavrou, “Exposing security
obfuscated software: a semantics-based approach,” in Proceedings of
risks for commercial mobile devices,” in Computer Network Security.
the 18th ACM conference on Computer and communications security.
Springer, 2012, pp. 3–21.
ACM, 2011, pp. 275–284.
[7] “Vmprotect - new-generation software protection,”
http://www.vmprotect.ru/, Tech. Rep.

268

You might also like