2015 USA Paper-Automatic Classification of Object Code Using Machine Learning
2015 USA Paper-Automatic Classification of Object Code Using Machine Learning
By
John Clemens
DFRWS is dedicated to the sharing of knowledge and ideas about digital forensics
research. Ever since it organized the first open workshop devoted to digital forensics
in 2001, DFRWS continues to bring academics and practitioners together in an
informal environment.
As a non-profit, volunteer organization, DFRWS sponsors technical working groups,
annual conferences and challenges to help drive the direction of research and
development.
http:/dfrws.org
Digital Investigation 14 (2015) S156eS162
Digital Investigation
journal homepage: www.elsevier.com/locate/diin
DFRWS 2015 US
a b s t r a c t
Keywords: Recent research has repeatedly shown that machine learning techniques can be applied to
Machine learning either whole files or file fragments to classify them for analysis. We build upon these
Classification techniques to show that for samples of un-labeled compiled computer object code, one can
Computer architecture
apply the same type of analysis to classify important aspects of the code, such as its target
Malware analysis
architecture and endianess. We show that using simple byte-value histograms we retain
Object code
enough information about the opcodes within a sample to classify the target architecture
with high accuracy, and then discuss heuristic-based features that exploit information
within the operands to determine endianess. We introduce a dataset with over 16000 code
samples from 20 architectures and experimentally show that by using our features, clas-
sifiers can achieve very high accuracy with relatively small sample sizes.
© 2015 The Authors. Published by Elsevier Ltd on behalf of DFRWS. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
http://dx.doi.org/10.1016/j.diin.2015.05.007
1742-2876/© 2015 The Authors. Published by Elsevier Ltd on behalf of DFRWS. This is an open access article under the CC BY-NC-ND license (http://
creativecommons.org/licenses/by-nc-nd/4.0/).
J. Clemens / Digital Investigation 14 (2015) S156eS162 S157
classifier that relies solely on the object code itself, ignoring Sickendick (2013) describes a system for firmware
any meta-data that may (or may not) be present. Secondly, disassembly including file carving and architecture detec-
the analysis from most previous work stops one level above tion using machine learning. For architecture detection, he
what we believe is possible. These systems will identify a adapts the method Kolter and Maloof (2006), used for
sample as containing object code, but won't give any more malware detection. The information gain for each byte
information than a general file label. When possible, we value 4-gram in the training set is calculated, and the top
should label the sample with information about the type of 500 4-grams are used as a feature vector for a DecisionTree
object code the sample contains. and an SVM classifier. This work is limited to four archi-
We propose methods that apply machine learning tectures common to SCADA devices and makes no attempt
techniques to automatically classify an object code sample to classify different endianess with the same architecture.
with its target architecture and endianess. Such a system Binwalk (Heffner, 2010) is a popular firmware analysis
automates the first phase of object code analysis, allowing tool that includes two techniques to identify object code.
the analyst to jump directly to decoding the instructions When run with the ‘-A’ option, Binwalk looks for archi-
and determining intent. tecture specific signatures indicative of object code.
The rest of this paper is structured as follows: The next Currently, Binwalk's architecture signature detection in-
sub-section discusses related research. In the Hypothesis cludes 33 signatures from 9 different architectures. How-
section we attempt to formalize the problem of archi- ever, Binwalk simply reports every place it finds a signature
tecture and endianess classification. Next we discusses and leaves it up to the user to make a classification decision
the intuition behind our proposed solutions, and then go based upon that information. Binwalk also includes a ‘-Y’
over our experimental design and results. We conclude option which will attempt to disassemble code fragments
with a discussion of the results and potential follow-on using the Capstone (Anh (2014)) disassembly framework
work. configured for multiple architectures. Binwalk currently
supports 9 configurations of 4 unique architectures for
Related research disassembly. Notably, both methods can potentially indi-
cate endianess as well as architecture.
Many systems exist to determine the type of binary code Binwalk's methods are effective in a wide variety of use
a file may contain. The simplest systems rely solely on the cases, but are not without their limitations. Signature based
file name or file extension. However, most systems rely on methods can lead to false positives if the byte signatures
the contents of a “file header” at a known location within are not unique when compared to other architectures. Ev-
the file (normally at the beginning) which includes meta- idence of such collisions exists in the Binwalk code itself,
data about what type of file it is, such as a document, pic- where a comment mentions that some 16-bit MIPS code
ture, or executable. The UNIX file command uses a database signatures are often detected in ARM Thumb code. Disas-
of “magic” values at known offsets within the file to classify sembly of a fragment can also cause issues. There is at least
the file type. In the case of executables or other object code, one case (i386 versus x86_64) where both architectures
these file type (ELF, PE, etc.) headers contain fields with could disassemble the same fragment of code without
information such as the target architecture, word size, and error. Both techniques rely on previous knowledge of the
endianess. Each of these systems uses some form of meta- architecture, and in the case of active disassembly, com-
data (file header, signature, or filename) that may not be plete knowledge and support in a disassembler framework.
available to an analyst. The technique presented in this paper takes a more holistic
McDaniel and Heydari (2003) were among the first to approach, and is able to classify architectures, both virtual
propose using characteristics derived from the contents of and physical, for which there are samples, even if infor-
an entire file to do classification. They used byte-value mation about the architecture is incomplete.
histograms as one of their representations and performed
statistical analysis to classify files. This inspired many more
researchers to use other methods including n-gram anal- Problem
ysis and SVMs to tackle the same problem. Examples
include Fitzgerald et al. (2012), Li et al. (2010, 2005), and We aim to automatically classify two characteristics of
Xie et al. (2013). Beebe et al. (2013) produced the Sceadan computer object code:
tool which builds upon much of this earlier work. This line
of research has concentrated on differentiating diverse file ! Architecture: The unique encoding of the computer's
types from each other. instructions.
Relating specifically to architecture classification, ! Endianess: The way the code expects multi-byte data to
Chernov and Troshina (2012) attempt to automate the be ordered when in memory.
analysis of custom virtual machines used by malware. Their
system uses opcode frequency counts as part of their Computer object code consists of a stream of machine
analysis system to help defeat code obfuscation within the instructions encoded as a string of bytes. The instruction
custom virtual machine. Similarly, Rad et al. (2012) show stream is loaded into memory and stored in the native
that opcode frequency code counts can be used to find endianess of the processor. The processor fetches in-
mutated forms of the same malware. They rely on knowl- structions from the instruction stream in memory, and then
edge of the underlying physical system's opcodes as an decodes and executes them. Computers share the same
indicator of program similarity. architecture if they use the same (or similar) encodings for
S158 J. Clemens / Digital Investigation 14 (2015) S156eS162
these machine instructions. The encoding of the in- ‘condition codes’. For most instructions, these are set to
structions is referred to as an instruction set. Some archi- b‘1110’, which means ‘always execute’. Therefore, one
tectures define fixed-length instruction encodings while would expect that a byte-value histogram for ARM systems
others define variable-length instruction encodings. This to contain many values that start with ‘0xE’. Intuitively, a
makes it impossible to determine the boundaries of in- machine learning algorithm should be able to accurately
structions within an instruction stream without knowing classify between these two architectures based solely on a
the target architecture. byte-value histogram.
Machine instructions consist of two parts: the opcode More generally, in order for a byte-value histogram to be
specifies which instruction the processor is to execute, and useful for classifying object code, the uniqueness of the
operands which specify what data (or pointers to data) architecture's opcodes must be preserved within the his-
that the instruction applies to. Opcodes are the byte rep- togram. To demonstrate this is possible, we need an esti-
resentation of the instruction and are specified by the ar- mation of how likely an opcode is to influence each byte
chitecture. Operands can be many things including within the code section. We call this the opcode density of
encoded register values, memory locations, and direct data the architecture, and it is calculated by the formula:
values. While opcode encodings are unique to a specific
architecture, operands vary with the data and flow of the length of opcode
Opcode Density ¼
particular program. To accurately classify the architecture, average instruction length
one should isolate its opcodes.
For fixed-length instruction set architectures, the in-
Endianess refers to the way the architecture stores
struction length is fixed (normally 32 or 64 bits depending
multi-byte data in memory. There are two ways multi-byte
on the architecture's word size), and the opcode takes up
values may be encoded: least significant byte first (little
between 6 and 12 bits, depending on the instruction. To use
endian) or most significant byte first (big endian).1 Most
MIPS as an example, the instruction length is 4 bytes, and
architectures define an endianess, so knowing the archi-
the opcode is 6 bits long, for an opcode density of
tecture automatically infers the endianess. However, some
approximately 19%. Practically, this means the first byte of
architectures (e.g. MIPS, ARM, Power) can be configured to
every instruction (one in four bytes) will have the opcode
use either endianess at runtime, and thus a proper classi-
encoded in its top 6 bits, heavily influencing its value.
fication must also determine the endianess of a sample for
Similar analysis can be carried out using the SPARC and
those architectures.
Alpha architectures, where the opcode is encoded in 8 bits,
Since endianess deals with the layout of data in mem-
and ARM (8-bit opcodes þ 4-bit condition codes). Even if
ory, it is difficult to determine from a sample of object code
we assume that the operands in the object code are random
alone. However, operands may contain immediate values
values, one can see that for fixed length instruction
and/or address values which are encoded in the native
encodings one in four byte-values within the object code
endianess of the architecture when stored in memory or on
will be heavily influenced by the opcode value.
disk. Any system that classifies endianess from an in-
For variable length instruction sets the analysis is more
struction stream may be able to extract that information
difficult, as we no longer know the ratio of opcodes to total
from the portion of the object code used for operands.
instruction length. Intel i386 opcodes have a minimum
length of one byte (but can be two or more). Blem et al.
Hypothesis (2013) show that on average, the i386 architecture for
general desktop workloads has an instruction length of 3.4
Previous research (McDaniel and Heydari, 2003) has bytes. This means that even if we assume one-byte opc-
shown that byte-value histograms over an entire file can be odes, our opcode density is approximately 30%, or at the
useful when classifying a file's type. We propose to apply very least it is higher than most fixed-length instruction
this same basic technique to the object code embedded encodings for a typical workload.
within a sample. We deliberately ignore the rest of the file These rough calculations give us some confidence that a
as it may contain meta-data that is either not present or not byte-value histogram can preserve information about the
trustworthy within a given scenario. opcode encoding, and thus can be used for architecture
Examples from some known architecture encodings classification.
gives us reason to believe that a byte-value histogram will
be useful for classification. The ‘amd64’ architecture is a 64-
bit extension of the ‘i386’ architecture, and uses a special Endianess
“prefix” byte for every instruction that uses 64-bit oper-
ands. This byte has the high 4-bit nibble set to b‘0100’ and Unfortunately, determining endianess is impossible
the lower four bits change depending on the rest of the with a byte-value histogram alone. Determining endianess
instruction. One would expect a byte-value histogram for a requires byte adjacency information, and adjacency infor-
sample from the amd64 architecture to contain many mation is lost in the conversion to the histogram. Therefore,
values that start with ‘0x4’. ARM instruction encoding in order to determine endianess, we need another set of
specifies the upper 4 bits of each instruction start with features that can preserve byte ordering information.
One approach would be to generate a 2-byte-value (bi-
gram) histogram. While this may encode adjacency infor-
1
There is also “mixed endian”, but that is no longer in wide use and not mation, it would explode our feature space from 256 di-
considered for this analysis. mensions to 65536, adding a large amount of
J. Clemens / Digital Investigation 14 (2015) S156eS162 S159
computational complexity. Also, despite the intuition, our of 8-bit micro-controllers as well as CUDA samples that
experiments show that this approach is not useful for target the nVidia line of GPUs. All sample files in this data
determining endianess. set are ELF files, and object code identified by using the
In the previous analysis we treated the operands for a PyBDF (Russ and Muniz (2013)) library to parse ELF section
sample as random noise. While convenient for that anal- information.
ysis, at least some instructions encode ‘immediate’ data A summary of the resulting dataset with samples from
within their operands. These operands are stored in the 20 different architectures is shown in Table 1. Of particular
object code in native-endian format. We aim to exploit this interest to endianess classification is the inclusion of ‘mips’
information to determine endianess using a small set of and ‘mipsel’ as two different classes. As both classes use the
heuristics. exact same opcodes, the only difference between the
On machines without an increment instruction, one samples is the endianess of values within their operands.
common operation when incrementing by a small value is As with all datasets, this one could be improved. All
to use an add instruction with an immediate operand of 1. samples except the CUDA samples are compiled with GCC.
On big endian machines, one is encoded in 32 bit as A different compiler might use a different mix of opcodes
0$00000001, while on little endian machines it is encoded and thus have a different signature. Additionally, there are
as 0$01000000. This provides us with a heuristic: if we many more 8 and 16-bit architectures than what are rep-
scan the object code for the 2-byte strings ‘0$0100’ and resented here. We hope to augment this dataset over time
‘0$0001’, then the latter should occur more often in little to add more diversity among the samples.
endian samples and the former should occur more often in
big endian samples. This could be repeated for other small Feature generation
values. Another common immediate value encoded in op-
erands are addresses. Some addresses, typically for stack As described above, we will use a feature vector that
values, are high up in the address space and start with contains a byte-value histogram of the code section
values like 0xfffe. Again, these addresses are stored differ- augmented with four additional counts of specific values
ently on big endian versus little endian machines, and a we will look for to indicate endianess. The layout of the
scan for both values 0xfffe and 0xfeff can be used as feature vector is shown in Fig. 1.
another indicator of endianess. When preparing the samples, we can choose to have
We propose to use these four heuristically derived 2- one feature vector per sample file, or we can choose to
byte frequency counts (‘0xfffe’,‘0xfeff’,‘0x0001’,‘0x0100’) extract the code from each file into one big pool and draw
as four new “endian” features to augment the byte-value equal-sized samples from the global pool. The latter
histogram, as shown in Fig. 1. We demonstrate that these approach might be beneficial to avoid an issue where an
features add the ability to predict endianess with minimal individual file's code sections are tiny, and thus has mostly
computational overhead. zero values in its histogram. However, the approach of one-
sample-per-file is a more realistic scenario in the field. For
Experiments this paper, one feature vector is generated per sample file.
Table 2
10-fold stratified cross validation accuracy for various models using the byte-value histogram alone, and the byte-value histogram augmented with heuristic-
based endianess attributes.
The byte-value histogram is generated by scanning code section). Random sampling removes any bias that may
every sample file for all sections labeled as executable code, present itself by continuously using the beginning of each
and then reading those sections one byte at a time to code section. For these feature vectors, the endian feature
generate our byte-value histogram. When the entire file counts are also generated using random 2-byte sampling of
has been processed, the histogram values are normalized N offsets within the code section, where N is the maximum
by dividing each value by the number of bytes of code size of the sample. The appropriate feature count is incre-
within that file. These make up the first 256 entries in the mented if the random 2-byte sample matches one of the
feature vector. The four additional endianess values are specific 2-byte values we're searching for. These counts are
calculated by a linear scan of each code section for the also normalized to the number of code bytes used within
specific two-byte values. These counts are normalized over the sample.
the size of the code sections within the file as well. All parts To test the effectiveness of 2-byte bi-grams, we generate
of the file that do not contain object code, as defined by the 64k-entry feature vectors for the ‘mips’, and ‘mipsel’ clas-
ELF section's CODE flag (or, in the case of CUDA code, an ELF ses. We can then compare the results when using this data
section named .nv_fatbin), are explicitly excluded from the subset to the overall results using our four endian features.
feature vectors.
In addition to generating samples that use the entire Results
code section within the sample file, we also want to test
against object code fragments of varying size. To generate We used the generated feature vectors to train a set of
those feature vectors, the same procedure is followed common multi-class classifiers available in WEKA (Hall
except that the byte values are taken as a random sampling et al. (2009)). The models chosen are inherently multi-
of the code bytes up to the desired size (or the end of the class, with the exception of the SVM (SMO) model which
uses a series of 1-versus-1 comparisons to choose the final
class. The results are summarized in Table 2 which shows
Table 3
Resulting per-class F-Measure for the Logistic Regression model. Note the the 10-fold stratified cross validation accuracy for the
increase in score for the mips and mipsel targets with the addition of chosen classifiers. Of note, the linear-based classifiers (Lo-
endian features. Other models show a similar pattern. gistic Regression, SVM) and the Decision Tree seem to have
Architecture F-measure the greatest accuracy, but all classifiers do very well. This
clearly shows that there is enough unique information
Histogram Hist þ Endian
about the architecture exposed within the byte histogram
alpha 0.992 0.997 to accurately classify object code in nearly all instances.
hppa 0.994 0.993
m68k 0.995 0.993
Table 3 shows the F-Measure values broken down by
arm64 0.987 0.994 class for the Logistic Regression classifier. F-Measure is the
ppc64 0.995 0.996 harmonic mean of Precision and Recall. Higher F-Measure
sh4 0.993 0.993 values indicate better classification performance, and a
sparc64 0.987 0.993
value of 1.0 would be perfect classification. The chart shows
amd64 0.987 0.990
armel 0.998 0.998
armhf 0.994 0.996 Table 4
i386 0.995 0.998 Comparison of the F-Measure results when using straight bi-grams over
ia64 0.995 0.995 the four heuristic endianess features proposed in this paper. The perfor-
mips 0.472 0.884 mance of the classifier when using the proposed features is significantly
mipsel 0.476 0.886 better while being less computationally expensive.
powerpc 0.990 0.989
s390 0.998 0.998 Trained model 64k Bi-grams Hist þ Endian
s390x 0.998 0.998
mips mipsel mips mipsel
sparc 0.988 0.988
cuda 0.444 0.516 Random Forest 0.530 0.453 0.721 0.681
avr 0.926 0.936 Decision Tree 0.477 0.476 0.897 0.897
J. Clemens / Digital Investigation 14 (2015) S156eS162 S161
Table 5
Full parameter list used for training each WEKA model. Deviations from the default values are marked in bold.
that the majority of the classification errors are caused in generate these results. Parameters for each classifier could
the ‘mips’ and ‘mipsel’ classes when we do not include our undoubtedly be tuned further for even greater classifica-
four endianess features and rely solely on the byte histo- tion performance.
gram. The dramatic improvement in F-Measure with these Finally, Table 4 shows the F-Measure of two models
features shows that they are indeed useful heuristics for classifying ‘mips’ versus ‘mipsel’ using a 64k bi-gram his-
determining endianess. Note that CUDA F-Measure scores togram versus our 260 feature byte histogram and endian
suffer from the small number of CUDA samples available features. Surprisingly, the bi-gram encoding appears to
within the dataset. preserve much less endian information than our simpler
These classifiers are mostly trained with their default heuristic-based method despite the much higher compu-
parameters. One notable exception to this is the Neural tational overhead of its larger feature vector.
Network classifier, which suffers from overfitting when
adding the endian features with the default network Sample size
structure of 260 $ 140 $ 20. A partial grid search over the
number of epochs and the number of hidden nodes suggest The above results achieve high accuracy using the every
a network configuration of 260 $ 66 $ 20 with 100 epochs byte of object code available within each sample. Another
results in performance in line with the other classifiers. See question is how large of a sample fragment do you need to
Table 5 for the full breakdown of all parameters used to achieve high accuracy. This is a useful metric for analysts
Fig. 2. 10-fold cross validation accuracy of the classifiers for different maximum sample sizes. Note that for both SVM and 1-NN, the accuracy approaches 90%
with only 16 bytes of sample data. By 8 KB, all classifiers are near or above 90% accuracy.
S162 J. Clemens / Digital Investigation 14 (2015) S156eS162
who often deal with incomplete fragments of samples. To Barrett and Charles Lepple of JHU/APL for discussion and
test this, we generate new feature vectors from our samples insight into previous research in this area. Additionally, we
using maximum sample sizes of four bytes up to one would like to thank the reviewers for their comments and
megabyte using the random sample methodology help preparing this paper for publication.
explained earlier. We then ran each of these size-based
feature sets through the models trained on the full-
References
sample instances. The results are summarized in Fig. 2.
These results show that for both the SVM and 1-NN clas- Anh QN. Capstone: next generation disassembly framework. USA:
sifiers, one can achieve very high accuracy even for tiny BlackHat; 2014.
amounts of sample data, and that by 8 KB, nearly all clas- Beebe N, Maddox L, Liu L, Sun M, Sept. Sceadan: using concatenated N-
gram vectors for improved file and data type classification. Inf Fo-
sifiers are above 90% accuracy. rensics Secur IEEE Trans 2013;8(9):1519e30.
Blanco A, Eissler M. One firmware to monitor em all. Ekoparty. 2012.
Discussion and further work Blem E, Menon J, Sankaralingam K. A detailed analysis of contemporary
ARM and x86 architectures. Tech. rep., UW-Madison. 2013.
Chernov A, Troshina K. Reverse engineering of binary programs for
We have shown that machine learning can be an custom virtual machines. Recon. 2012.
effective tool to classify the target architecture of object Delugre! G. Closer to metal: reverse engineering the broadcom netext-
reme's firmware. Presented at Hack.lu. 2010.
code. As this method is independent of potentially Fitzgerald S, Mathews G, Morris C, Zhulyn O. Using NLP techniques for file
misleading meta-data, it provides both a way to verify fragment classification. Digit Investig 2012;9:S44e9.
existing meta-data and a way forward when no meta-data Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The
WEKA data mining software: an update. SIGKDD Explor. Newsl
is present. We have developed heuristics that can be used Nov. 2009;11(1):10e8. URL, http://doi.acm.org/10.1145/1656274.
to predict the endianess of code. Of the classifiers tested, 1656278.
SVM and nearest neighbor approaches appear to provide Heffner C. Binwalk firmware analysis tool. 2010. Accessed 09.04.15. URL,
http://binwalk.org/.
good classification performance regardless of fragment
Kolter JZ, Maloof MA. Learning to detect and classify malicious execut-
size. ables in the wild. J Mach Learn Res 2006;7:2721e44.
Going forward, we would like to expand our current Li Q, Ong AY, Suganthan PN, Thing VL. A novel support vector machine
architecture dataset to include a more varied sampling of approach to high entropy data fragment classification. In: SAISMC;
2010. p. 236e47.
architectures. We intend to include more embedded plat- Li W-J, Wang K, Stolfo S, Herzog B. Fileprints: identifying file types by
forms, microcontroller code, and more GPU samples. We N-gram analysis. In: Information Assurance Workshop, 2005. IAW
will also include samples using different compilers than ’05. Proceedings from the Sixth Annual IEEE SMC; June 2005.
p. 64e71.
GCC, including LLVM/Clang and Microsoft Visual Studio, to McDaniel M, Heydari MH. Content based file type detection algorithms.
make sure that different code generation engines do not In: System Sciences, 2003. Proceedings of the 36th Annual Hawaii
effect the overall classification performance. International Conference on. IEEE; 2003. 10 pp.
Miller C. Battery firmware hacking: Inside the innards of a smart battery.
In addition to expanding the dataset, we will continue to Tech. rep., Accuvant Labs. 07 2011.
explore other areas to apply machine learning to binary Rad BB, Masrom M, Ibrahim S. Opcodes histogram for classifying meta-
object code. Two interesting areas of research include code morphic portable executables malware. In: e-Learning and e-Tech-
nologies in Education (ICEEE), 2012 International Conference on.
attribution, and automated reverse engineering techniques IEEE; 2012. p. 209e13.
such as determining function boundaries. We feel that Russ F, Muniz S. A python interface to the GNU binary file descriptor
machine learning could play an important role in (BFD) library. 2013. Accessed 09.04.15. URL, https://github.com/
Groundworkstech/pybfd.
advancing these research areas.
Sickendick KA. File carving and malware identification algorithms
applied to firmware reverse engineering. Tech. rep., DTIC Document.
Acknowledgments 2013.
Xie H, Abdullah A, Sulaiman R. Byte frequency analysis descriptor with
spatial information for file fragment classification. In: Proceeding of
The authors would like to thank Dr. Tim Oates of UMBC the International Conference on Artificial Intelligence in Computer
for guidance on machine learning techniques, and Brad Science and ICT (AICS 2013); 2013.