0% found this document useful (0 votes)

15 views11 pages

A Boosting Ensemble For The Recognition of Code Sharing in Malware

boosting

Uploaded by

jtrevall88

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views11 pages

A Boosting Ensemble For The Recognition of Code Sharing in Malware

boosting

Uploaded by

jtrevall88

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

J Comput Virol (2008) 4:335–345

DOI 10.1007/s11416-008-0087-z

ORIGINAL PAPER

A boosting ensemble for the recognition of code sharing in malware

Stanley J. Barr · Samuel J. Cardman ·
David M. Martin Jr.

Received: 11 November 2007 / Revised: 29 March 2008 / Accepted: 6 April 2008 / Published online: 22 April 2008
© Springer-Verlag France 2008

Abstract Research and development efforts have recently 1 Introduction

started to compare malware variants, as it is believed that mal-
ware authors are reusing code. A number of these projects The benefits of reusing a software component are consid-
have focused on identifying functions through the use of sig- erable. The cost of the product may decrease because less
nature-based classifiers. We introduce three new classifiers effort must be expended in development and the reliability
that characterize a function’s use of global data. Experiments of the product may increase because the reused component
on malware show that we can meaningfully correlate func- has already been tested. Thus, it comes as no surprise to learn
tions on the basis of their global data references even when that writers of malware reuse old code and incorporate each
their functions share little code. We also present an algorithm others’ designs and implementations into their programs. In
that combines existing classifiers and our new ones into an this context, there is an urgent need to be able to compare
ensemble for correlating functions in two binary programs. two malware samples and identify commonality.
For testing, we developed a model for comparing our work Researchers can reverse engineer a piece of malicious soft-
to previous signature based classifiers. We then used that ware to understand its operation [1]. However, reverse engi-
model to show how our new combined ensemble classifier neering is laborious. In cases such as the Bagle virus, new
dominates the previously reported classifiers. The resulting versions keep appearing [2]. A new variant may appear before
ensemble can be used by malware analysts when they are the analysis of a previous version is complete. Leveraging the
comparing two binaries. This technique will allow them to analysis of common behavior preserved between generations
correlate both functions and global data references between would be a great step in understanding and combating new
the two and will lead to a quick identification of any sharing malware variants.
that is occurring. Global identifiers such as variable and function labels
allow for the correlation of functions between two binaries.
But malware is usually presented as a stripped binary, i.e.,
function and data labels have been removed and no source
code or other metadata is available. This makes it more dif-
ficult to analyze the binary or compare it to another binary.
S. J. Barr (B) · S. J. Cardman Current techniques of comparing two binaries for similar-
Distributed Systems Center, The MITRE Corporation,
ity rely on correlating functions by generating code based sig-
Bedford, MA 01730, USA
e-mail: [email protected] natures [3–10]. Some example features for these code based
signatures are comprised of instruction sequences, control
S. J. Cardman
e-mail: [email protected] flow graphs, function call graphs, and DLL (dynamically
linked library) function calls. Thus, for example, binaries
D. M. Martin Jr. can be compared when they reference external functions pro-
Computer Science Department,
vided by the operating system through the use of DLLs,
University of Massachusetts Lowell,
Lowell, MA 01854, USA because symbol names must be kept when external refer-
e-mail: [email protected] ences are used. A signature can be constructed for a function

123
336 S. J. Barr et al.

simply by enumerating the names of externally referenced one has an accurate assessment of the extent or scope of
functions and the number of times each external function is this practice.
mentioned in the binary code. If two functions in different – In our sample set we found the average malware pro-
binaries share such a signature and each signature is unique gram had 134 global data references, suggesting that at
within its own executable, then this would suggest that the least this subset of malware is amenable to correlation of
two functions are related. global data references.
Now the question naturally arises: can one improve the
comparison by including and correlating global data refer- Combined with the threats posed by malware, these reasons
ences? Roughly speaking, any absolute address named by a make it a reasonable data set for analysis, comparison, and
binary’s machine code is considered a global data reference exploration.
if it is read from, written to, or stored on the stack. Thus, no As previously stated, our current data set has on average
distinction is made between scalars, arrays, or pointers. 134 data references per executable. As malware programs
To answer the question, we have combined several sig- grow in complexity and size we believe more references
nature based classifiers into an ensemble. This ensemble will appear in binaries. Provided that each data reference
includes existing classifiers and introduces three new classi- has a unique usage pattern (see signature (10)), we currently
fiers. The first new classifier, GdrMatcher, associates a piece believe any number of them could be correlated between two
of global data with the functions that refer to it. The second related samples.
classifier, AllLocalIds, describes a function by listing its own It should be noted although we are analyzing malware,
global data references. The third classifier, AllParentIds, de- we are not attempting to create an antivirus tool for detecting
scribes a function f by listing the global data references of the malware variants. Antivirus tools may be able to detect that
callers of f . This new ensemble can be used to correlate both two programs are variants even when we cannot identify any
functions and global data references between two binaries. shared functions: their goal is to use any distinctive portion of
We develop a model to compare the ensemble’s performance the code or data to construct a signature for reliable detection.
before and after our new classifiers have been integrated and We instead seek to correlate information in order to support
demonstrate that the new ensemble is an improvement over the reverse engineering process, and allow researchers to find
the former. commonalities between binaries.
We also conduct an experiment to explore the concept of The paper is structured as follows. Section 2 describes
correlating global data references in an attempt to answer previous efforts from which we have drawn components.
two basic questions: Section 3 discusses desirable properties for our classifiers.
Section 4 describes our algorithm. In Sect. 5, we describe
1. Can we use these correlated data references to develop our experiments. Section 6 offers our conclusions and some
more accurate function correlations? possible future directions. More details can be found in an
2. Can we correlate global data references among entire extended version of this paper [12].
binaries known to be related and avoid spurious correla-
tions between unrelated binaries?
2 Related work

We conducted our experiments with malware for the fol- There are a number of techniques that have been shown to
lowing reasons. help identify software variants. We have drawn heavily from
these works.
– There are large data sets of hand-classified malware vari-
ants that we can compare with our algorithm’s output. – Hashes: Cryptographic hashes have been used to identify
– When malware first appears, it must be analyzed in software clones in source and in binaries [4], although
order to understand the full extent of its behavior. How- they are overly sensitive to minor variations in the input.
ever, reliable detection and identification even of bounded – Sequence analysis: Sequence analysis algorithms [9,13]
length mutating viruses is NP-complete [11]. Alternate have proven successful in the analysis of related bina-
approaches to malware identification are therefore quite ries. In fact, most antivirus tools are based on sequence
appropriate. matching. These techniques usually do not require disas-
– Malware authors modify existing malware to improve sembling binary code. This is appealing, since malicious
functionality and evade detection by antivirus products. coders can employ a number of techniques to thwart dis-
These modifications are accomplished both by modifying assembly [14]. However, they are susceptible to certain
source code and directly manipulating existing binaries. types of obfuscation such as the substitution of semanti-
While it is suspected that malware authors share code, no cally equivalent instruction sequences.

123
A boosting ensemble for the recognition of code sharing in malware 337

In a recent paper [10], the author demonstrated some directly to the binary. Changing optimization levels to trade
interesting results based on Boolean function modeling. computation time for program size can significantly change
The patterns he used to detect malware using this new the generated code. Different compilers or compiler versions
model allow for efficient detection. Also, the patterns are may generate different binaries for the same code. While in
robust to black-box analysis, making it difficult for mal- general authors can make an infinite variety of modifications,
ware authors to determine what portions of the executable they often follow the ancient adage “If it ain’t broke, don’t
comprise the pattern being checked. fix it!” By only altering code whose behavior they seek to
The author of [15] takes a completely different approach change, they may leave enough information intact to iden-
for sequence analysis. This work uses the notion of nor- tify and correlate functions.
malized compression distance, which is derived from In this research we restrict ourselves to the following types
Kolmogorov complexity [16]. The idea put forth is that of variants: instances where functions or blocks have been
if two strings are similar then the concatenation and sub- reordered, code where the compiler settings may have intro-
sequent compression of the strings will result in a length duced subtle changes, and cases where changes have left
which is not that much larger than the length of com- the structure of the code relatively similar. These restrictions
pressing the larger of the two by itself. And conversely, allow for variability in the structure or functionality when
the same operation on two strings which are not simi- comparing functions. They also require some measurable
lar would result in an output length significantly larger feature be preserved. This measurable feature becomes the
than the result of compressing either individually. The basis for the claim that two pieces of code being compared
difference of the resulting length’s is considered a dis- are variants. To identify these variations our ensemble will be
tance between the two strings. The file contents of mal- insensitive to the following categories of code modifications,
ware samples are used as input strings, and the resulting as referenced in [5]:
distances between the samples are used to cluster them.
– Graph matching: We use the techniques given in [3], – Function reordering
which allow for the construction of hashes from control – Alternate register allocation schemes
flow graphs. Other research has offered novel techniques – Locations and offset changes
for identifying variants of binaries by mapping program – Branch inversion and block re-ordering
control flow features into directed graphs and then using – Simple insertion, deletion, transposition, and substitution
graph theoretic techniques to determine the similarity of of instructions
the original code [5,8].
– Function signatures: The authors in [5,6] describe using Instead of trying to accomplish the above objectives with
structural components of functions to create signatures. a single complex classifier, we use a number of simple clas-
Some of the structural components which have been used sifiers. Each one focuses on identifying functions based on
to compare functions in different executables are: the different characteristics. The goal is to construct an ensem-
strings referenced, system calls made, and assembly lan- ble with classification ability better than the sum of each of
guage opcodes used. the classifiers working independently. Works such as [17,18]
have laid foundations showing that one can combine multiple
This research has focused on enhancing these approaches weak clustering and classification algorithms to produce a
to increase their detection of software variants. When we single strong ensemble.
have the ability to correlate functions between two binaries,
it is useful to have a metric that gives a numeric value to the
amount of code shared. We have incorporated the similarity 4 Boosting algorithm
metric developed in [6] as our measure.
4.1 High-level description

3 Desirable properties for binary code classifiers Our ensemble takes two binaries as input. Processing starts
by extracting features from both binaries. These features will
In order to reliably detect variant binaries, it is important form the basis of our signature construction. Features are
to understand how changes can manifest themselves in the provided primarily by parsing the C-like output of the REC
binaries. The work done in [5,3] provides more information Reverse Engineering Compiler [19]. Each feature will be
on variability in compiled code bases. Simply put, traditional introduced and developed as needed.
compilers follow a deterministic process during code genera- Once the features have been extracted the ensemble starts
tion. Therefore, variants occur because a different compiler is an iterative matching process, during which it accumulates
used, the source code is modified, or modifications are made Φ, a common set of equivalent functions, and Γ , a common

123
338 S. J. Barr et al.

set of equivalent global data references. Both sets are Since not all functions are wrapper stubs, this signature
initialized to the empty set and share the form {(a1 , b1 ), may be only partially defined. The partial signature
(a2 , b2 ), . . .}. In Φ each ai and bi represent function ad-
dresses in binaries A and B respectively. Similarly, in Γ , ai DllName : (fn) → RealName (2)
and bi represent global data reference pairs.
Pairs are matched and added to these sets using one of our takes fn, the location of a wrapper function, and yields
signature algorithms as follows. First, only currently unla- a text string RealName that denotes the name of the im-
beled objects are considered during each iteration. We then ported function.
focus on those objects that have a unique signature in the first – MD5 hash of binary function: The signature
binary and search for that signature to be present uniquely
in the second binary. If found, the two items are paired and MD5bin : (fn) → 128-bit hash (3)
labeled, the pair is inserted into the appropriate Φ or Γ set,
and the two items are removed from the pool of unlabeled takes as input a binary representation of the function fn
objects. At the beginning of the next iteration, we reevaluate and yields its 128-bit hash. MD5 was chosen due to its low
the signatures previously ignored due to their non-unique- probability of collisions.1 It exhibits very high precision
ness: a signature that was previously non-unique in its binary in determining exact matches, so there is almost no chance
may later become unique again if all but one of its objects in of getting a false positive. However, a single bit change
the same executable have since been successfully matched causes an unrecoverable mismatch. Therefore, we would
between the executables. The algorithm terminates after a like a more resilient approach that gives roughly the same
cycle in which no new pairs are matched. precision without being as sensitive to a single bit change.
– MD5 hash of opcodes from a function: As suggested in
4.2 Signatures [4], we relax the MD5 hash to apply to the string form of
the opcodes that comprise a function. The signature
Our ensemble classifier uses two types of signatures. Static
signatures depend only on the extracted features. Dynamic MD5opcode : (fn) → 128-bit hash (4)
signatures make use of matchings accumulated in Φ and Γ
in addition to extracted features. This improves the quality takes as input the sequence of opcodes that comprise the
of subsequent signatures generated, although at the cost of function fn (e.g., “xor”, “inc”, “mov”, etc.) and yields its
computing some signatures on the fly. 128-bit hash. This feature is invariant to differences in
The following two lists describe our signature generation register assignment, immediate values, and other oper-
algorithms. Each algorithm will be depicted as ands but still represents the sequence of instructions in
Name : (arg1 [, arg2 , ..., argn ]) → Signature. (1) fn. It would seem that the opcode signature alone should
be sufficient to differentiate functions. However, during
Here Name is a short descriptive identifier. The first argument our experiments we found many counterexamples to this
arg1 will represent both the binary and either some extracted notion. It appeared that many functions were constructed
feature of a function or the location of a global data reference from template code, perhaps as a consequence of macro
in that binary. Any other required features will be passed in expansion. Sometimes, the only differentiator was a sim-
arg2 through argn . Each algorithm will yield Signature, a ple constant value, or a reference to a specific global mem-
value which can be compared with others (generated by the ory location.
same algorithm) in linear time based on the length of the – Control flow graph: Control flow graphs allow us to com-
signatures. pare the control structure of two functions. Each function
can be depicted as a directed graph whose nodes repre-
4.2.1 Statically constructed signatures sent the basic blocks of contiguous instructions where
control only passes directly from one instruction to the
– Named symbol look up: Executables generally use DLLs next, and edges represent the flow of control connecting
to invoke external functions by name, whether they are the blocks of contiguous instructions. This allows us to
implemented in the operating system or in other applica- compare functions without being concerned about their
tion libraries. Compilers support this by generating wrap- constituent instructions.
per stubs in the caller’s executable. Each stub is a simple Since detecting graph isomorphism is GI-complete [20],
wrapper function used to call the desired target through we relax the signature to make comparisons feasible. The
an import table that is populated at runtime. Our first sig-
nature labels each such stub address with the name (as 1 Note that the specific MD5 collisions demonstrated in 2004 do not
a string) of the function it is dynamically linked against. appear to directly affect the probability of collisions on random inputs.

123
A boosting ensemble for the recognition of code sharing in malware 339

signature represents the graph in a way that captures some grows as the algorithm runs, we call these signatures
of the edges in the graph, but also allow for linear time dynamic.
complexity for comparison. This representation is safe:
it captures features necessary for detecting isomorphism, – Function call vector matchers: The function call vectors
although it does not capture all such features. we extract allow us to create a signature for each func-
Our specific approach follows that of [3]. Kruegal et. al. tion fn depending only on the list of functions directly
constructed adjacency matrices using all subsets of nodes invoked by fn. This signature is invariant under control
for a given fixed size k, and the edges that connected flow changes and other types of code modifications. We
those nodes in the function’s control flow graph. For our limit this signature to enumerating the number of times
research we arbitrarily set k = 4. The individual rows an unlabeled function makes calls to functions that are
of each 4-by-4 adjacency matrix were then concatenated already known to exist in both binaries.
together to form a 16 bit value, each of which is labeled The signature
h i in the following signature.2 The list of 16 bit values
constructed for each function is sorted once during the CallRefs : (fn, Φ) → {(n 1 , x1 ), . . .} (7)
feature extraction phase. To compare two functions dur-
ing a run, we simply compare the sorted lists of values
takes as input fn’s entry from the call vector enumerating
for equality. This test can be done in time linear in the
all of the functions that fn invokes along with the number
number of features.
of times they are invoked. The result is a set of tuples of
The signature
the form (n i , xi ), with each tuple indicating that fn calls
xi in n i different places, and each xi ∈ Φ. The latter
Cfg : (fn) → {h 1 , h 2 , . . .} (5) condition means that this signature only considers those
functions called by fn that have already been matched.
requires fn be the representation of the control flow graph We emphasize that this tuple results from a static analy-
for the given function and yields a set of features {h 1 , h 2 , sis of fn: the number n i indicates the number of different
. . .} representing that control flow graph. instructions in fn that call xi , rather than the number of
– Parameter count: We also count the number of param- times xi is called at run time. A simpler form of this sig-
eters passed at entry and total number of unique global nature was used in [6], wherein each n i is a Boolean value
variables referenced by the function. The signature rather than a call count.
Likewise, the signature
Params : (fn) → ( p, g f , gd ) (6)
ParentRefs : (fn, Φ) → {(n 1 , x1 ), ...} (8)
yields a tuple for function fn where p is the number of
parameters that fn receives, g f is the number of pointers takes as input fn’s entry from the call vector enumerat-
to recognized functions that appear in fn, and gd counts ing all of the functions that invoke fn and the number of
how many other global data references appear in fn. Each times fn is invoked by that function. The result is a set of
of these values tends to be small, so the Params signature tuples of the form (n i , xi ), with each tuple indicating that
is fairly coarse. xi calls fn in n i different places, and each xi ∈ Φ.
– Duplicate signature combiner: We define a compound
4.2.2 Dynamically constructed signatures signature

The preceding signatures do not depend on the greater con- JoinSig : (fn, D) → {(t1 , v1 ), (t2 , v2 ), . . .} (9)
text in which the functions appear. We call them static, since
they only need to be computed once. The remaining sig- where D is a set of tuples ( f i , ti , vi ) in which function
natures depend both on the function itself and knowledge f i has a signature of type ti with value vi . The output
deduced about the overall executable. Since this knowledge is those pairs (ti , vi ) of signature types and values that
describe fn. In our algorithm, the set D will be popu-
2 It is important to note that many 4-by-4 adjacency matrices are iso- lated with a list of signatures that are duplicates: they are
morphic to one another. As we were constructing a 16 bit value for not unique in their functions’ own executables. By com-
comparison we needed to make sure all isomorphic adjacency matrices bining duplicate signatures of different types, we may
generated the same 16 bit value. As part of our experimental setup we
be able to construct a unique signature. For example con-
constructed a 4-by-4 adjacency matrix for each possible 16 bit value.
All isomorphic adjacency matrices were grouped together and for each sider a program A, where D A = {( f 1 , Params, (1, 1, 0)),
group a representative 16 bit value was chosen to be returned. ( f 2 , Params, (1, 1, 0)), ( f 2 , Cfg, (x, y, z)), ( f 3 , Cfg,

123
340 S. J. Barr et al.

(x, y, z))}. Note that none of the signature values in D A 4.3 Ensemble classifier algorithm
is unique.
Then JoinSig( f 1 , D A) = {(Params, (1, 1, 0))}. Howe- The ensemble classifier algorithm maintains two sets for each
ver, this consists of a single signature type, and our algo- binary; a set of unlabeled functions and a set of unlabeled
rithm will separately use the Params (1, 1, 0) signature global data references. The function StaticSig(S, ItemSet)
to attempt to match functions in A and B. Therefore, takes a signature-generating function S to invoke and a set
the JoinSig does not contribute any new information in ItemSet of unlabeled items of the type S takes as input. It
the attempt to match f 1 to a function in B. Similarly, yields a pair of sets (Uniqs, Dups) where Uniqs = {(x1 , Sig1 ),
f 3 does not benefit from JoinSig in this case. However, (x2 , Sig2 ), ...} with each item xi ∈ ItemSet, S(xi ) = Sigi ,
JoinSig( f 2 , D A) = {(Params, (1, 1, 0)), (Cfg, (x, y, and no Sigi is repeated in the list. The set Dups = {(Sig1 , x1 ,
z))}. Neither f 1 nor f 3 share this JoinSig; it is therefore x2 , ...), ...} where each signature Sigi associated with at least
unique and may be used to search for a matching function two items from ItemSet. Every item in ItemSet appears either
in B. in Uniqs or Dups.
– Global data reference matcher: This new signature makes The function DynamicSig(S,A f ,Φ, Γ ) works the same
use of global data references contained in the binary pro- way as StaticSig except that it takes as arguments Φ and
gram. The signature Γ , which as stated earlier are the sets of functions and glo-
bals, respectively, that have been deemed equivalent between
the two programs. Dynamic signatures use information pre-
GdrMatcher : (gdr, Φ) → {x1 , x2 , ...} (10) viously accumulated in Φ and Γ . Therefore, if Φ and Γ
are empty then no new signatures will be generated by
takes a global data reference gdr and Φ, and yields a DynamicSig.
set of functions where each xi ∈ Φ contains a refer- The function EquateEquivs(Uniqs A , Uniqs B , ItemSet A ,
ence to gdr . If the signature value is unique within a ItemSet B , LabeledSet) processes Uniqs sets. Any Sigi found
program, then its global data reference gdr is uniquely in both Uniqs A and Uniqs B describes an item that is consid-
shared among the named subgroup of functions. Also, ered equivalent in the two binaries. The items paired with
this subgroup of functions is known to exist in both bina- Sigi in their corresponding Uniqs set are removed from their
ries, because each xi ∈ Φ represents a matched func-
tion. If some global data reference from the other bi-
nary shares this unique signature, then our algorithm will Algorithm 1 Ensemble classifier
equate them. When two variables are equated during the Require: A f and B f to be unlabeled function locations.
matching process, we insert the pair into a common set of A g and Bg to be unlabeled global data references locations.
Φ = {}
shared variables Γ , just as functions are placed into Φ. Γ = {}
– All labeled locations matchers: These two signatures for (U A , D Adlls ) = StaticSig(DllName, A f )
matching functions make use of knowledge contained in (U B , D Bdlls ) = StaticSig(DllName, B f )
both Φ and Γ . The signature EquateEquivs(U A , U B , A f , B f , Φ)
repeat
for S ∈ {MD5bin , MD5opcode , Cfg, Params} do
(U A , D A S ) = StaticSig(S, A f )
AllLocalIds : (fn, Φ, Γ ) → ({x1 , ...}, {y1 , ...}) (11) (U B , D B S ) = StaticSig(S, B f )
EquateEquivs(U A , U B , A f , B f , Φ)
end for
takes as input a function fn, Φ, and Γ and yields a pair for
of sets. The xi ∈ Φ are the previously matched functions S ∈ {CallRefs, ParentRefs, AllLocalIds, AllParentIds}
that fn calls, and the yi ∈ Γ are the previously matched do
(U A , D A S ) = DynamicSig(S, A f , Φ, Γ )
global data references that fn refers to. (U B , D B S ) = DynamicSig(S, B f , Φ, Γ )
Likewise, the signature EquateEquivs(U A , U B , A f , B f , Φ)
end for
D A = D AS
AllParentIds : (fn, Φ, Γ ) → ({x1 , ...}, {y1 , ...}) (12) D B = D BS
(U A , U B ) = JoinSigsOfDups(D A, D B)
EquateEquivs(U A , U B , A f , B f , Φ)
U A = DynamicSig(GdrMatcher, A g , Φ, Γ )
takes as input a function fn, Φ, and Γ and yields pair of U B = DynamicSig(GdrMatcher, Bg , Φ, Γ )
sets. The xi ∈ Φ are the previously matched functions EquateEquivs(U A , U B , A g , Bg , Γ )
that call fn, and the yi ∈ Γ are the previously matched until Φ stops growing
global data references that are referred to by some xi . return Φ,Γ

123
A boosting ensemble for the recognition of code sharing in malware 341

respective ItemSets and are paired together and appended to relate the percentage of global data references in the common
LabeledSet: specifically, function pairs are appended to Φ set.
and global data references to Γ .
The algorithm starts by using DllName (see (2)) to iden-
tify common imported functions. As all imported functions 5 Experiment
are guaranteed to be unique this is done only once. It then
enters a repeat/until loop which iterates until no new func- 5.1 Data set
tion pairs are appended to Φ. At the start of each iteration the
algorithm executes each of the static signature classifiers to We acquired the malware data set from [21] to conduct our
equate functions. Although the static signatures themselves experiment. There were 4200 programs that were unpack-
do not change, their classification as unique or duplicate may able with RAR [22], UPX [23], or that required no unpack-
change, because the sets A f and B f shrink as functions ing and appeared to be non-obfuscated. For the experiment
are equated. The algorithm then equates functions using our we selected the 638 programs that seemed UPX-unpackable
dynamic signatures. and unpacked them for analysis. The name-based taxon-
After all function based signatures are used individually omy included with these binaries indicated that some were
we call JoinSigsOfDups(D A, D B), where D A and D B are already considered variants of each other while others were
the sets of all the Dups sets generated by all the individual not known to be related.
signature algorithms. This function uses the JoinSig (see (9))
and the supplied Dups sets to identify those JoinSigs that are 5.2 Performance model
unique among the JoinSigs in their own binary. These unique
signatures are then used to match functions, if possible. To show that our new signatures (10), (11), and (12) offer
Finally, after all signatures primarily based on functions improvement, we need an improvement measure other than
have been used, we use that information in GdrMatcher to a statistical test for accuracy. This is simply because no such
correlate global data references into Γ . After all global data objective accuracy test exists.
references have been correlated this iteration is complete. If The data set does contain labels indicating that certain
Φ had no new functions appended during this iteration the binaries are considered variants of each other. However, mal-
loop is terminated. ware classification and taxonomy construction is an open
Each iteration removes at least one function from A f and problem [24]. We cannot assume that the labeling in our
B f . Therefore, the maximum number of iterations for our input data set is complete or even consistent. There is evi-
ensemble is bounded by the number of the unlabeled func- dence that malware authors reuse code, so if we discover
tions in the input programs. more matches among code than are reported in the provided
labels, we should not necessarily conclude that our matches
4.4 Measure of similarity are incorrect. And although malware authors may rewrite
code merely in order to disguise it, they may also rewrite code
Upon termination the ensemble has computed a set of func- in order to change its functionality. In that situation, decreas-
tion matches Φ and data reference matches Γ for the given ing the computed numerical correlation between two gener-
pair of binaries. To represent the correlation of the binaries we ations of the same program is appropriate. Therefore, there is
use the measure of similarity between two programs devel- currently no objective way to prove that one classifier ensem-
oped in [6]. In computing this measure let programs A and ble is more accurate than another simply because it produces
B have a total of i and j compiled functions respectively more (or fewer) matches. More generally, there are no uni-
and let n = |Φ|, the number of functions deemed equiva- versally accepted quantifiable software engineering metrics
lent between the two programs. The fraction n/i is in the for measuring changes in binary code. Thus, developing a
interval of [0, 1] and relates what percentage of the functions statistical test for improvement of accuracy is not practical
in program A are in the common set. Similarly, n/j relates when comparing classifiers in sets of compiled programs.
information pertaining to program B. The metric Instead of making a case for the improvement of accu-
racy, we demonstrate that by including our new signatures
n2
Sim Φ = (13) we have created a new classifier that should be used in place
ij of another one. If our new classifier produces what we think
will also be in the interval [0, 1] and will be used to relate is a rational behavior at least as often as the original version,
the overall percentage of functions in the common set. following [25] we will say that the new classifier is dominant
In a corresponding way we let n = |Γ | and i, j represent for this classification task.
the global data references in the two programs. The same Given the following criteria, we claim our new ensemble
computation with n, i, j will produce Sim Γ . We use this to dominates the original. Let s1 and s2 be programs,

123
342 S. J. Barr et al.

LabeledVariants be a function that determines whether We also examined the individual changes in matchings
they are labeled as variants in the input data set, and that come from incorporating global data references.
Matches(classifier) be the number of functions matched by Figure 1 depicts the operation of the original ensemble in
classifier when comparing s1 and s2 . Here classifier is either relation to the new ensemble. Each point in the graph repre-
C1 , the old ensemble of signatures (2–9), or C2 , which also sents comparisons of two programs that are labeled as vari-
includes our new signatures (10–12). The function VarsFound ants in our data set. The x-coordinate is the percentage Sim Φ
returns true when C2 finds s1 and s2 share at least one vari- for the original ensemble. The y-coordinate is the percentage
able. In (14) we simply accumulate votes for which classifier Sim Φ for our new ensemble.
does a better job based on the available information. The In the 800 pairings of programs labeled variants, sixty-
classifier that accumulates the higher total of votes will be eight of them resulted in a higher Sim Φ value and nine pair-
considered the dominant classifier. ings resulted in a lower value. The decrease in Sim Φ in these
nine pairings was due to the new classifier matching twelve
⎧ fewer function pairs and constructing one different function
⎪ C1 if not LabeledVariants(s1 , s2 ) and
⎪
⎪ pairing. We manually reviewed these twelve function pairs.
⎪
⎪ Matches(C1 ) < Matches(C2 )
⎪
⎪ In all but one case our manual review agreed there was reason
⎪
⎪
⎪ C
⎪ 1
if LabeledVariants(s1 ,s2 ) and
to remove each of the matchings; in the manual review of the
⎪
⎪ Matches(C1 ) > Matches(C2 )
⎪
⎪
⎪
⎪
one different pairing, we agreed with the new ensemble.
⎪
⎨ C 1 if not LabeledVariants(s1 , s2 ) and We expected that programs that had been previously
dom(s1, s2) = Matches(C 1 ) = Matches(C 2 ) labeled as variants of each other would have generally higher
⎪
⎪
⎪
⎪ and VarsFound(s1 , s2 ) degrees of similarity than unrelated ones. We believed this
⎪
⎪
⎪
⎪ C if LabeledVariants(s 1 , s2 ) and would allow for a number of variables identified in one ver-
⎪
⎪
1
⎪
⎪ = sion to be transferred to another. We took every pair of pro-
⎪
⎪
Matches(C 1 ) Matches(C 2)
⎪
⎪
⎪
⎪ and not VarsFound(s1 , s2 ) grams labeled variants and plotted Sim Φ as a percent versus
⎩ the percentage of variables named in Fig. 2. As programs
C2 otherwise
share more functions in common, we are able to detect more
(14)
shared data references.
For the remainder of the paper we will say the classifier with- In general, we are able to correlate about 9% global data
out our new signatures is the original ensemble classifier, and references as computed by Sim Γ when two programs labeled
the new ensemble classifier contains our new signatures. variants share 50% code as computed by Sim Φ . The question
arises: what is the quality of the matches made by our ensem-
5.3 Similarity results ble? We picked global data reference matches generated by
our ensemble at random and reviewed the code manually.
Analysis of the 638 programs resulted in 203203 pair wise Program pairings that shared less than 25% code had many
program comparisons, each of which resulted in a vote for a variable matches in which our manual review disagreed with
classifier as defined in (14). This number includes 800 pro- our ensemble. But when programs exhibit function similarity
gram pairs labeled as variants in the input data set and the
remaining 202403 not designated variants. This labeling was
done by the maintainers of the malware repository [21]. The 100
"-"
Percentage of functions deemed equivalent

experiment resulted in a total of 149590 (74%) votes in favor

of our new classifier. Note that the overwhelming majority of 80
pairs (99.6%) are not labeled as variants. When comparing
by improved classifier

such pairs, equation (14) results in a vote cast for the old clas-
60
sifier if the new classifier matches more functions than the
old one does. Thus this result shows that on 74% of all pairs,
the new classifier is not finding any more function matches 40

than the old one did.

Of those pairs that were labeled variants in the input, there 20
were 610 (76%) votes for the new classifier. This means that
the new classifier was either able to match more functions
0
or it matched the same number and was also able to match a 0 20 40 60 80 100
Percentage of functions deemed equivalent bypre-exsiting classifier
global variable.
Together, these results indicate that the new classifier is Fig. 1 Function matches in the original ensemble compared to the new
dominant for the classification task on this data set. ensemble on binaries that are labeled variants

123
A boosting ensemble for the recognition of code sharing in malware 343

Percentage of global references deemed equivalent

100
"-" Table 2 Top five programs thought unrelated by name with percentage
of the functions and global data references shared
80 Fns Data

Program A Program B # % # %
60
Trojan-Spy.Banker.di Trojan-Spy.Coiboa.b 520 89 28 19
Trojan.ZomJoiner.01.a Trojan-PSW.Zombie.12 493 88 82 17
40
Backdoor.Fluxay.0473 HackTool.SmbCrack.4 560 74 78 8
Backdoor.DarkSky.b Trojan.ZomJoiner.01.a 479 76 56 7
20 Backdoor.DarkSky.b Trojan-PSW.M2.147 490 71 81 14

0 Table 3 Statistics generated from votes cast

0 20 40 60 80 100
Percentage of functions deemed equivalent
Total pairs 203203
Fig. 2 Named functions versus named global data references
Number of variants 800
Precision 1%
Recall 77%
metrics above 50%, the variables matched generally seemed
Accuracy 73%
to have similar usage roles in the two executables.
False negatives 22%
Thus it appears that global variable matching is better used
False positives 26%
for associating variables in executables that are suspected to
be related, and boosting confidence in that assessment, than
as a primary means for determining whether two executables
are in fact variants of each other. We revisit this question in ween the executables strongly suggests that the VxHeavens
Sect. 5.4. names for these programs do not adequately reflect their
We next compare a subset of our results with those gen- shared ancestry. Furthermore, the matched data references
erated by Carrera et al. [6]. Carrera had access to a different provide important points of reference for an analyst trying to
subset of samples than we did. Carrera split his results into discern the shared behaviors of the matched programs.
two tables, one for variants A, B, C, D, E, and F with another Table 2 shows our top five novel discoveries of this code
for variants H, I, J, L, and M. For space reasons we show only sharing. Columns one and two are the names of the programs,
the latter in Table 1, results for the former are similar. Each columns three and four show the number and percentage of
filled entry represents the Sim Φ × 100 value generated by functions shared according to Sim Φ , and finally columns
(C)arrera on the left, our (O)riginal ensemble in the middle, five and six show the number and percentage of the total data
and our (N)ew ensemble on the right. This table shows that references shared as according to Sim Γ .
our results are generally consistent with Carrera’s.
The most interesting result was the discovery of programs, 5.4 Using global data references as a discriminator
which while unrelated by the taxonomy, still seemed to share
a large number of functions and global data references. No We also considered using global data references to predict
analysis has been conducted yet to determine the nature of the whether two binaries are variants. For this experiment, we
units of code shared among these executables. However, both regarded the input labels of executables as ground truth and
the percentage and sheer number of matched functions bet- attempted to determine whether two executables are variants
of each other. If we detected at least one shared global data
reference, then we reported that the two programs are vari-
Table 1 Carrera’s and our results for Email-Worm.Win32.Mimail sam- ants; otherwise we reported that they were unrelated.
ples H, I, J, L, and M From the votes tabulated we constructed five standard sta-
I J L M tistical measures, shown in Table 3. Precision indicates what
percentage of examples correctly classified as variants were
C O N C O N C O N C O N
predicted to be variants. Recall shows what percentage of
H 82 84 84 83 84 84 91 94 94 89 96 96 actual variants are correctly classified. Accuracy shows the
I 95 98 98 82 79 81 80 85 85 percentage of correct classification decisions. False positives
J 83 79 81 81 85 85 show the percentage of negative examples incorrectly clas-
L 90 98 98 sified as positives. False negatives show the percentage of
positive examples incorrectly classified as negatives.

123
344 S. J. Barr et al.

The good recall and accuracy suggests that our new Acknowledgments We thank Dr. Richard Greene and Dr. Desiree
ensemble makes reasonable assessments about whether two Beck of The MITRE Corporation, and Dr. Karen Daniels of The Uni-
versity of Massachusetts at Lowell for taking an interest in this research
programs may be variants. More analysis needs to be done to and providing valuable insights. We also thank Stacey Arnold as well
understand the nature of the false negatives; it may simply be as the anonymous reviewers for their time and thoughtful suggestions.
that those variants do not share any correlatable data refer-
ences. The low precision and high false positive percentages
are not surprising: we found evidence of code sharing among References
programs not designated as variants. We believe that the
results generated in Table 2 are not spurious, and instead 1. Rozinov, K.: Reverse code engineering: an in-depth analysis of
reveal commonality not designated by the current taxonomy. the Bagle virus. In: Systems, Man and Cybernetics (SMC) Infor-
mation Assurance Workshop, 2005. Proceedings from the Sixth
Annual IEEE, pp. 380–387 (2005)
2. Gordon, J.: Lessons from virus developers: The beagle worm his-
tory through April 24 (2004)
6 Conclusion and future work 3. Kruegel, C., Kirda, E., Mutz, D., Robertson, W., Vigna, G.:
Polymorphic Worm Detection Using Structural Information of
Executables. In: Rapid Advances in Intrusion Detection (RAID),
We have shown that several weak classifiers can be com- pp. 207–226 (2005)
bined to build a stronger ensemble classifier for use in reverse 4. Schulman, A.: Finding binary clones with opstrings and function
engineering. The ensemble we presented can be used to cor- digests. Dr. Dobbs 374, 375, 376, 69–73, 56–61, 64–70 (2005)
5. Dullien, T., Rolles, R.: Graph-based comparison of executable
relate global data references between binaries. Among bina- objects. In: Proceedings of the Symposium sur la Sécurité des
ries known to be related, we can correlate roughly 9% of the Technologies de lìnformation et des Communications (SSTIC),
total data references using Sim Γ as our measure. pp. 421–433 (2005). http://www.sstic.org/
Thus, we believe we can answer yes to both questions 6. Carrera, E., Erdélyi, G.: Digital genome mapping—advanced
binary malware analysis. Virus Bulletin Conference, pp. 187–197
posed in the introduction: correlating data references does (2004)
lead to better function correlations, and correlated data ref- 7. Flake, H.: Structural comparison of executable objects. In: Pro-
erences do usefully predict whether two binaries are related. ceedings of the Conference on Detection of Intrusions and Malware
There is no standard quantitative measure for compar- & Vulnerability Assessment, pp. 161–174 (2004)
8. Sabin, T.: Comparing binaries using bindview. Technical report,
ing binaries. As stated earlier, we incorporated the similarity Sabre (2004)
metric used by [6] to produce an overall correlation met- 9. Karim, E., Walenstein, A., Lakhotia, A., Parida, L.: Malware phy-
ric between two programs. While the quantitative metrics of logeny generation using permutations of code. J. Comput. Virol.
Sim Γ and Sim Φ do not adequately describe the correlations 1(1–2), 13–23 (2005)
10. Filiol, E.: Malware pattern scanning schemes secure against black-
found, they at least allow us a standard measure for com-
box analysis. J. Comput. Virol. 2(1), 35–50 (2006) EICAR 2006
parison. A robust quantitative metric for comparing binaries Special Issue
would allow researchers to compare results in a standardized 11. Spinellis, D.: Reliable identification of bounded length viruses is
way. We believe quantitative change metrics is an important NP-complete. IEEE Transactions on Information Theory, pp. 280–
284 (2003)
area for future investigation. 12. Matching global data references in related executables (2007)
We stated that when executables shared 50% of a common 13. Newsome, J., Karp, B., Song, D.: Polygraph: Automatically gener-
code base most data references matched were almost always ating signatures for polymorphic worms. In: SP’05: Proceedings of
corroborated by the manual review whereas with fewer than the 2005 IEEE Symposium on Security and Privacy, Washington,
DC, USA, IEEE Computer Society, pp. 226–241 (2005)
25% sharing the matches were often error prone. We plan 14. Linn, C., Debray, S.: Obfuscation of executable code to improve
to conduct a more thorough review to plot the decline in the resistance to static disassembly. In: Proceedings of the 10th ACM
quality of matches with the decrease in the shared code base. Conference on Computer and Communication Security, pp. 290–
Such a curve would potentially allow us to know how much 299 (2003)
15. Wehner, S.: Analyzing worms and network traffic using compres-
confidence can be placed in the matches generated. sion. J. Comput. Secur. 15(3), 303–320 (2007)
The modular construction of this ensemble classifier 16. Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity
allows for new signatures to be added quickly. We plan on and Its Applications. Springer, Berlin (1997)
extending our data reference signatures. We plan to correlate 17. Freund, Y., Schapire, R.E.: Experiments with a new boosting algo-
rithm. In: International Conference on Machine Learning, pp. 148–
data references based on how they are accessed in functions. 156 (1996)
And we plan on incorporating the roles of variables [26] into 18. Topchy, A., Jain, A.K., Punch, W.: Clustering ensembles: Models
our framework as fine grain signatures on global data ref- of consensus and weak partitions. IEEE Trans. Pattern Anal. Mach.
erences. Lastly, except for the MD5 checksum signatures, Intell. 27(12), 1866–1881 (2005)
19. Capriono, G.: REC Reverse Engineering Compiler, Version 1.6
this type of analysis is independent of microprocessor, so (2000)
another line of analysis would be to analyze variants com- 20. Toran, J.: On the hardness of graph isomorphism. SIAM J. Com-
piled for different microprocessors. put. 33(5), 1093–1108 (2004)

123
A boosting ensemble for the recognition of code sharing in malware 345

21. Vx heavens website (2006) 26. Kuittinen, M., Sajaniemi, J.: Teaching roles of variables in elemen-
22. Labs, R.: Rar Compression Homepage (2006) tary programming courses. In: ITiCSE’04: Proceedings of the 9th
23. UPX: Upx Homepage (2007) annual SIGCSE conference on Innovation and technology in com-
24. Filiol, E., Helenius, M., Zanero, S.: Open problems in computer puter science education, New York, NY, USA, pp. 57–61. ACM
virology. J. Comput. Virol. 1(3–4), 55–66 (2006) Press, New York (2004)
25. Provost, F., Fawcett, T., Kohavi, R.: The case against accuracy
estimation for comparing induction algorithms. In: Proceedings
of the Fifteenth International Conference on Machine Learning,
pp. 445–453 (1998)

123