0% found this document useful (0 votes)

2 views8 pages

2015 USA Paper-Automatic Classification of Object Code Using Machine Learning

The document discusses the application of machine learning techniques for the automatic classification of un-labeled compiled computer object code, focusing on identifying target architecture and endianess. It presents a dataset of over 16,000 code samples from 20 architectures and demonstrates high accuracy in classification using byte-value histograms and heuristic-based features. The research aims to streamline the digital forensics process by enabling analysts to quickly classify object code without relying on potentially unreliable metadata.

Uploaded by

biswajittezu4469

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views8 pages

2015 USA Paper-Automatic Classification of Object Code Using Machine Learning

Uploaded by

biswajittezu4469

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

DIGITAL FORENSIC RESEARCH CONFERENCE

Automatic Classification of Object Code

Using Machine Learning

John Clemens

From the proceedings of

The Digital Forensic Research Conference

DFRWS 2015 USA
Philadelphia, PA (Aug 9th - 13th)

DFRWS is dedicated to the sharing of knowledge and ideas about digital forensics
research. Ever since it organized the first open workshop devoted to digital forensics
in 2001, DFRWS continues to bring academics and practitioners together in an
informal environment.
As a non-profit, volunteer organization, DFRWS sponsors technical working groups,
annual conferences and challenges to help drive the direction of research and
development.

http:/dfrws.org
Digital Investigation 14 (2015) S156eS162

Contents lists available at ScienceDirect

Digital Investigation
journal homepage: www.elsevier.com/locate/diin

DFRWS 2015 US

Automatic classiﬁcation of object code using machine

learning
John Clemens a, b, *
a
University of Maryland, Baltimore County (UMBC), Baltimore, MD, USA
b
Johns Hopkins University Applied Physics Laboratory (JHU/APL), Laurel, MD, USA

a b s t r a c t

Keywords: Recent research has repeatedly shown that machine learning techniques can be applied to
Machine learning either whole files or file fragments to classify them for analysis. We build upon these
Classification techniques to show that for samples of un-labeled compiled computer object code, one can
Computer architecture
apply the same type of analysis to classify important aspects of the code, such as its target
Malware analysis
architecture and endianess. We show that using simple byte-value histograms we retain
Object code
enough information about the opcodes within a sample to classify the target architecture
with high accuracy, and then discuss heuristic-based features that exploit information
within the operands to determine endianess. We introduce a dataset with over 16000 code
samples from 20 architectures and experimentally show that by using our features, clas-
sifiers can achieve very high accuracy with relatively small sample sizes.
© 2015 The Authors. Published by Elsevier Ltd on behalf of DFRWS. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Motivation GPU-enabled programs, ﬁrmware for network cards and

other devices which contain embedded CPUs (Blanco and
Digital forensics remains largely a manual process Eissler, 2012, Delugre!, 2010), management co-processors
requiring detailed and time consuming analysis by experts (Miller, 2011), and USB drivers for devices that contain
within the field. In particular, the analysis of computer their own processors for services like data compression or
executables, either for forensic analysis, reverse engineer- encryption. The object code for these devices is often stored
ing, or malware detection, remains a time consuming task in files with non-standard headers or embedded inside
as the level or expertise needed to understand compiled driver object files. Analysts are seeking tools to jump-start
object code is quite high. Additionally, the explosion of the analysis process by automatically labeling unknown
different types of devices (cell phones, complex routers, samples.
smart sensors, the internet of things (IoT)) means that ex- Plenty of recent research has shown that raw byte fre-
perts are no longer dealing with just one computing ar- quency analysis can be used to classify files and file frag-
chitecture, but instead are seeing a myriad of executable ments. These analyses fall short in two areas when applied
code (firmware, mobile apps, etc.) traversing their net- to object code. First, by taking the entire sample into
works and showing up in forensic and malware samples. consideration, they include file meta-data into their anal-
Even generic desktop workstations contain object code for ysis. In many cases this is beneficial, but there are a few
architectures other than the main CPU. These can include cases where this might be a concern. For example, the
sample itself may be incomplete (a partial forensic disk
recovery or a partial packet capture), not trustworthy
(deliberate obfuscation by malware), or simply have no
* University of Maryland, Baltimore County (UMBC), Baltimore, MD,
USA. meta-data (firmware, reverse engineering, raw instruction
E-mail addresses: [email protected], [email protected]. traces from virtual machines). Ideally analysts want a

http://dx.doi.org/10.1016/j.diin.2015.05.007
1742-2876/© 2015 The Authors. Published by Elsevier Ltd on behalf of DFRWS. This is an open access article under the CC BY-NC-ND license (http://
creativecommons.org/licenses/by-nc-nd/4.0/).
J. Clemens / Digital Investigation 14 (2015) S156eS162 S157

classifier that relies solely on the object code itself, ignoring Sickendick (2013) describes a system for firmware
any meta-data that may (or may not) be present. Secondly, disassembly including file carving and architecture detec-
the analysis from most previous work stops one level above tion using machine learning. For architecture detection, he
what we believe is possible. These systems will identify a adapts the method Kolter and Maloof (2006), used for
sample as containing object code, but won't give any more malware detection. The information gain for each byte
information than a general file label. When possible, we value 4-gram in the training set is calculated, and the top
should label the sample with information about the type of 500 4-grams are used as a feature vector for a DecisionTree
object code the sample contains. and an SVM classifier. This work is limited to four archi-
We propose methods that apply machine learning tectures common to SCADA devices and makes no attempt
techniques to automatically classify an object code sample to classify different endianess with the same architecture.
with its target architecture and endianess. Such a system Binwalk (Heffner, 2010) is a popular firmware analysis
automates the first phase of object code analysis, allowing tool that includes two techniques to identify object code.
the analyst to jump directly to decoding the instructions When run with the ‘-A’ option, Binwalk looks for archi-
and determining intent. tecture specific signatures indicative of object code.
The rest of this paper is structured as follows: The next Currently, Binwalk's architecture signature detection in-
sub-section discusses related research. In the Hypothesis cludes 33 signatures from 9 different architectures. How-
section we attempt to formalize the problem of archi- ever, Binwalk simply reports every place it finds a signature
tecture and endianess classification. Next we discusses and leaves it up to the user to make a classification decision
the intuition behind our proposed solutions, and then go based upon that information. Binwalk also includes a ‘-Y’
over our experimental design and results. We conclude option which will attempt to disassemble code fragments
with a discussion of the results and potential follow-on using the Capstone (Anh (2014)) disassembly framework
work. configured for multiple architectures. Binwalk currently
supports 9 configurations of 4 unique architectures for
Related research disassembly. Notably, both methods can potentially indi-
cate endianess as well as architecture.
Many systems exist to determine the type of binary code Binwalk's methods are effective in a wide variety of use
a file may contain. The simplest systems rely solely on the cases, but are not without their limitations. Signature based
file name or file extension. However, most systems rely on methods can lead to false positives if the byte signatures
the contents of a “file header” at a known location within are not unique when compared to other architectures. Ev-
the file (normally at the beginning) which includes meta- idence of such collisions exists in the Binwalk code itself,
data about what type of file it is, such as a document, pic- where a comment mentions that some 16-bit MIPS code
ture, or executable. The UNIX file command uses a database signatures are often detected in ARM Thumb code. Disas-
of “magic” values at known offsets within the file to classify sembly of a fragment can also cause issues. There is at least
the file type. In the case of executables or other object code, one case (i386 versus x86_64) where both architectures
these file type (ELF, PE, etc.) headers contain fields with could disassemble the same fragment of code without
information such as the target architecture, word size, and error. Both techniques rely on previous knowledge of the
endianess. Each of these systems uses some form of meta- architecture, and in the case of active disassembly, com-
data (file header, signature, or filename) that may not be plete knowledge and support in a disassembler framework.
available to an analyst. The technique presented in this paper takes a more holistic
McDaniel and Heydari (2003) were among the first to approach, and is able to classify architectures, both virtual
propose using characteristics derived from the contents of and physical, for which there are samples, even if infor-
an entire file to do classification. They used byte-value mation about the architecture is incomplete.
histograms as one of their representations and performed
statistical analysis to classify files. This inspired many more
researchers to use other methods including n-gram anal- Problem
ysis and SVMs to tackle the same problem. Examples
include Fitzgerald et al. (2012), Li et al. (2010, 2005), and We aim to automatically classify two characteristics of
Xie et al. (2013). Beebe et al. (2013) produced the Sceadan computer object code:
tool which builds upon much of this earlier work. This line
of research has concentrated on differentiating diverse file ! Architecture: The unique encoding of the computer's
types from each other. instructions.
Relating specifically to architecture classification, ! Endianess: The way the code expects multi-byte data to
Chernov and Troshina (2012) attempt to automate the be ordered when in memory.
analysis of custom virtual machines used by malware. Their
system uses opcode frequency counts as part of their Computer object code consists of a stream of machine
analysis system to help defeat code obfuscation within the instructions encoded as a string of bytes. The instruction
custom virtual machine. Similarly, Rad et al. (2012) show stream is loaded into memory and stored in the native
that opcode frequency code counts can be used to find endianess of the processor. The processor fetches in-
mutated forms of the same malware. They rely on knowl- structions from the instruction stream in memory, and then
edge of the underlying physical system's opcodes as an decodes and executes them. Computers share the same
indicator of program similarity. architecture if they use the same (or similar) encodings for
S158 J. Clemens / Digital Investigation 14 (2015) S156eS162

these machine instructions. The encoding of the in- ‘condition codes’. For most instructions, these are set to
structions is referred to as an instruction set. Some archi- b‘1110’, which means ‘always execute’. Therefore, one
tectures define fixed-length instruction encodings while would expect that a byte-value histogram for ARM systems
others define variable-length instruction encodings. This to contain many values that start with ‘0xE’. Intuitively, a
makes it impossible to determine the boundaries of in- machine learning algorithm should be able to accurately
structions within an instruction stream without knowing classify between these two architectures based solely on a
the target architecture. byte-value histogram.
Machine instructions consist of two parts: the opcode More generally, in order for a byte-value histogram to be
specifies which instruction the processor is to execute, and useful for classifying object code, the uniqueness of the
operands which specify what data (or pointers to data) architecture's opcodes must be preserved within the his-
that the instruction applies to. Opcodes are the byte rep- togram. To demonstrate this is possible, we need an esti-
resentation of the instruction and are specified by the ar- mation of how likely an opcode is to influence each byte
chitecture. Operands can be many things including within the code section. We call this the opcode density of
encoded register values, memory locations, and direct data the architecture, and it is calculated by the formula:
values. While opcode encodings are unique to a specific
architecture, operands vary with the data and flow of the length of opcode
Opcode Density ¼
particular program. To accurately classify the architecture, average instruction length
one should isolate its opcodes.
For fixed-length instruction set architectures, the in-
Endianess refers to the way the architecture stores
struction length is fixed (normally 32 or 64 bits depending
multi-byte data in memory. There are two ways multi-byte
on the architecture's word size), and the opcode takes up
values may be encoded: least significant byte first (little
between 6 and 12 bits, depending on the instruction. To use
endian) or most significant byte first (big endian).1 Most
MIPS as an example, the instruction length is 4 bytes, and
architectures define an endianess, so knowing the archi-
the opcode is 6 bits long, for an opcode density of
tecture automatically infers the endianess. However, some
approximately 19%. Practically, this means the first byte of
architectures (e.g. MIPS, ARM, Power) can be configured to
every instruction (one in four bytes) will have the opcode
use either endianess at runtime, and thus a proper classi-
encoded in its top 6 bits, heavily influencing its value.
fication must also determine the endianess of a sample for
Similar analysis can be carried out using the SPARC and
those architectures.
Alpha architectures, where the opcode is encoded in 8 bits,
Since endianess deals with the layout of data in mem-
and ARM (8-bit opcodes þ 4-bit condition codes). Even if
ory, it is difficult to determine from a sample of object code
we assume that the operands in the object code are random
alone. However, operands may contain immediate values
values, one can see that for fixed length instruction
and/or address values which are encoded in the native
encodings one in four byte-values within the object code
endianess of the architecture when stored in memory or on
will be heavily influenced by the opcode value.
disk. Any system that classifies endianess from an in-
For variable length instruction sets the analysis is more
struction stream may be able to extract that information
difficult, as we no longer know the ratio of opcodes to total
from the portion of the object code used for operands.
instruction length. Intel i386 opcodes have a minimum
length of one byte (but can be two or more). Blem et al.
Hypothesis (2013) show that on average, the i386 architecture for
general desktop workloads has an instruction length of 3.4
Previous research (McDaniel and Heydari, 2003) has bytes. This means that even if we assume one-byte opc-
shown that byte-value histograms over an entire file can be odes, our opcode density is approximately 30%, or at the
useful when classifying a file's type. We propose to apply very least it is higher than most fixed-length instruction
this same basic technique to the object code embedded encodings for a typical workload.
within a sample. We deliberately ignore the rest of the file These rough calculations give us some confidence that a
as it may contain meta-data that is either not present or not byte-value histogram can preserve information about the
trustworthy within a given scenario. opcode encoding, and thus can be used for architecture
Examples from some known architecture encodings classification.
gives us reason to believe that a byte-value histogram will
be useful for classification. The ‘amd64’ architecture is a 64-
bit extension of the ‘i386’ architecture, and uses a special Endianess
“prefix” byte for every instruction that uses 64-bit oper-
ands. This byte has the high 4-bit nibble set to b‘0100’ and Unfortunately, determining endianess is impossible
the lower four bits change depending on the rest of the with a byte-value histogram alone. Determining endianess
instruction. One would expect a byte-value histogram for a requires byte adjacency information, and adjacency infor-
sample from the amd64 architecture to contain many mation is lost in the conversion to the histogram. Therefore,
values that start with ‘0x4’. ARM instruction encoding in order to determine endianess, we need another set of
specifies the upper 4 bits of each instruction start with features that can preserve byte ordering information.
One approach would be to generate a 2-byte-value (bi-
gram) histogram. While this may encode adjacency infor-
1
There is also “mixed endian”, but that is no longer in wide use and not mation, it would explode our feature space from 256 di-
considered for this analysis. mensions to 65536, adding a large amount of
J. Clemens / Digital Investigation 14 (2015) S156eS162 S159

computational complexity. Also, despite the intuition, our of 8-bit micro-controllers as well as CUDA samples that
experiments show that this approach is not useful for target the nVidia line of GPUs. All sample files in this data
determining endianess. set are ELF files, and object code identified by using the
In the previous analysis we treated the operands for a PyBDF (Russ and Muniz (2013)) library to parse ELF section
sample as random noise. While convenient for that anal- information.
ysis, at least some instructions encode ‘immediate’ data A summary of the resulting dataset with samples from
within their operands. These operands are stored in the 20 different architectures is shown in Table 1. Of particular
object code in native-endian format. We aim to exploit this interest to endianess classification is the inclusion of ‘mips’
information to determine endianess using a small set of and ‘mipsel’ as two different classes. As both classes use the
heuristics. exact same opcodes, the only difference between the
On machines without an increment instruction, one samples is the endianess of values within their operands.
common operation when incrementing by a small value is As with all datasets, this one could be improved. All
to use an add instruction with an immediate operand of 1. samples except the CUDA samples are compiled with GCC.
On big endian machines, one is encoded in 32 bit as A different compiler might use a different mix of opcodes
0$00000001, while on little endian machines it is encoded and thus have a different signature. Additionally, there are
as 0$01000000. This provides us with a heuristic: if we many more 8 and 16-bit architectures than what are rep-
scan the object code for the 2-byte strings ‘0$0100’ and resented here. We hope to augment this dataset over time
‘0$0001’, then the latter should occur more often in little to add more diversity among the samples.
endian samples and the former should occur more often in
big endian samples. This could be repeated for other small Feature generation
values. Another common immediate value encoded in op-
erands are addresses. Some addresses, typically for stack As described above, we will use a feature vector that
values, are high up in the address space and start with contains a byte-value histogram of the code section
values like 0xfffe. Again, these addresses are stored differ- augmented with four additional counts of specific values
ently on big endian versus little endian machines, and a we will look for to indicate endianess. The layout of the
scan for both values 0xfffe and 0xfeff can be used as feature vector is shown in Fig. 1.
another indicator of endianess. When preparing the samples, we can choose to have
We propose to use these four heuristically derived 2- one feature vector per sample file, or we can choose to
byte frequency counts (‘0xfffe’,‘0xfeff’,‘0x0001’,‘0x0100’) extract the code from each file into one big pool and draw
as four new “endian” features to augment the byte-value equal-sized samples from the global pool. The latter
histogram, as shown in Fig. 1. We demonstrate that these approach might be beneficial to avoid an issue where an
features add the ability to predict endianess with minimal individual file's code sections are tiny, and thus has mostly
computational overhead. zero values in its histogram. However, the approach of one-
sample-per-file is a more realistic scenario in the field. For
Experiments this paper, one feature vector is generated per sample file.

We tested the theory that our features are sufﬁcient to

classify architecture and endianess by creating a dataset of Table 1
sample object code, generating the representative feature Dataset statistics for all 20 architectures. Note that these reﬂect the
vectors, and then training machine learning models using samples that are in the dataset, not the full capabilities of the architecture.
For example, there can be HPPA systems that are 64-bit, and ARM, MIPS,
our features. and PowerPC can all be conﬁgured as either little endian or big endian.

Architecture # Samples Wordsize Endianess

Dataset
alpha 1383 64-bit Big
The Linux operating system has been ported to many hppa 625 32-bit Big
m68k 1296 32-bit Big
different architectures since its inception, and provides a
arm64 1134 64-bit Little
rich starting point for our dataset. A typical distribution ppc64 823 64-bit Big
installs anywhere from 600 to 1300 ﬁles that contain sh4 822 32-bit Little
compiled object code for the supported architectures. A sparc64 752 64-bit Big
amd64 965 64-bit Little
large number of our samples come from the Debian Linux
armel 960 32-bit Little
distribution for different architectures. To augment the armhf 960 32-bit Little
dataset beyond what is available within Linux systems, we i386 967 32-bit Little
collected samples of Arduino code that targets the AVR line ia64 650 64-bit Little
mips 960 32-bit Big
mipsel 960 32-bit Little
powerpc 992 32-bit Big
s390 649 32-bit Big
s390x 653 64-bit Big
sparc 648 32-bit Big
cuda 17 32-bit Little
avr 596 8-bit Little
Total 16,785
Fig. 1. Layout of the full 260-dimension feature vector.
S160 J. Clemens / Digital Investigation 14 (2015) S156eS162

Table 2
10-fold stratiﬁed cross validation accuracy for various models using the byte-value histogram alone, and the byte-value histogram augmented with heuristic-
based endianess attributes.

Trained model Multi-class Strategy WEKA name Histogram Hist þ Endian

1-NN Inherent IBk 89.3238% 92.7256%

3-NN Inherent IBk 89.8660% 94.9002%
Decision Tree Inherent J48 93.2976% 98.0697%
Random Tree Inherent RandomTree 87.8046% 92.9461%
Random Forest Inherent RandomForest 90.4617% 96.4373%
Naive Bayes Inherent NaiveBayes 92.5827% 95.8951%
BayesNet Inherent BayesNet 89.5144% 92.2252%
SVM (SMO) 1-vs-1 SMO 92.7256% 98.3497%
Logistic Regression Inherent SimpleLogistic 93.0831% 97.9386%
Neural Net Inherent MultilayerPerceptron 94.0244% 97.9565%

The byte-value histogram is generated by scanning code section). Random sampling removes any bias that may
every sample file for all sections labeled as executable code, present itself by continuously using the beginning of each
and then reading those sections one byte at a time to code section. For these feature vectors, the endian feature
generate our byte-value histogram. When the entire file counts are also generated using random 2-byte sampling of
has been processed, the histogram values are normalized N offsets within the code section, where N is the maximum
by dividing each value by the number of bytes of code size of the sample. The appropriate feature count is incre-
within that file. These make up the first 256 entries in the mented if the random 2-byte sample matches one of the
feature vector. The four additional endianess values are specific 2-byte values we're searching for. These counts are
calculated by a linear scan of each code section for the also normalized to the number of code bytes used within
specific two-byte values. These counts are normalized over the sample.
the size of the code sections within the file as well. All parts To test the effectiveness of 2-byte bi-grams, we generate
of the file that do not contain object code, as defined by the 64k-entry feature vectors for the ‘mips’, and ‘mipsel’ clas-
ELF section's CODE flag (or, in the case of CUDA code, an ELF ses. We can then compare the results when using this data
section named .nv_fatbin), are explicitly excluded from the subset to the overall results using our four endian features.
feature vectors.
In addition to generating samples that use the entire Results
code section within the sample file, we also want to test
against object code fragments of varying size. To generate We used the generated feature vectors to train a set of
those feature vectors, the same procedure is followed common multi-class classifiers available in WEKA (Hall
except that the byte values are taken as a random sampling et al. (2009)). The models chosen are inherently multi-
of the code bytes up to the desired size (or the end of the class, with the exception of the SVM (SMO) model which
uses a series of 1-versus-1 comparisons to choose the final
class. The results are summarized in Table 2 which shows
Table 3
Resulting per-class F-Measure for the Logistic Regression model. Note the the 10-fold stratified cross validation accuracy for the
increase in score for the mips and mipsel targets with the addition of chosen classifiers. Of note, the linear-based classifiers (Lo-
endian features. Other models show a similar pattern. gistic Regression, SVM) and the Decision Tree seem to have
Architecture F-measure the greatest accuracy, but all classifiers do very well. This
clearly shows that there is enough unique information
Histogram Hist þ Endian
about the architecture exposed within the byte histogram
alpha 0.992 0.997 to accurately classify object code in nearly all instances.
hppa 0.994 0.993
m68k 0.995 0.993
Table 3 shows the F-Measure values broken down by
arm64 0.987 0.994 class for the Logistic Regression classifier. F-Measure is the
ppc64 0.995 0.996 harmonic mean of Precision and Recall. Higher F-Measure
sh4 0.993 0.993 values indicate better classification performance, and a
sparc64 0.987 0.993
value of 1.0 would be perfect classification. The chart shows
amd64 0.987 0.990
armel 0.998 0.998
armhf 0.994 0.996 Table 4
i386 0.995 0.998 Comparison of the F-Measure results when using straight bi-grams over
ia64 0.995 0.995 the four heuristic endianess features proposed in this paper. The perfor-
mips 0.472 0.884 mance of the classifier when using the proposed features is significantly
mipsel 0.476 0.886 better while being less computationally expensive.
powerpc 0.990 0.989
s390 0.998 0.998 Trained model 64k Bi-grams Hist þ Endian
s390x 0.998 0.998
mips mipsel mips mipsel
sparc 0.988 0.988
cuda 0.444 0.516 Random Forest 0.530 0.453 0.721 0.681
avr 0.926 0.936 Decision Tree 0.477 0.476 0.897 0.897
J. Clemens / Digital Investigation 14 (2015) S156eS162 S161

Table 5
Full parameter list used for training each WEKA model. Deviations from the default values are marked in bold.

Trained Model WEKA name Parameters

1-NN IBk -K 1 -W 0 -A “weka.core.neighboursearch.-LinearNNSearch -A “weka.core.-EuclideanDistance -R ﬁrst-last””

3-NN IBk -K 3 -W 0 -A “weka.core.neighboursearch.-LinearNNSearch -A “weka.core.-EuclideanDistance -R first-last””
Decision Tree J48 -C 0.25 -M 2
Random Tree RandomTree -K 0 -M 1.0 -V 0.001 -S 1
Random Forest RandomForest -I 100 -K 0 -S 1 -num-slots 1
Naive Bayes NaiveBayes N/A
BayesNet BayesNet -D -Q weka.classifiers.bayes.net.-search.local.K2 – -P 1
-S BAYES -E weka.classifiers.bayes.net.-estimate.SimpleEstimator – -A 0.5
SVM (SMO) SMO -C 1.0 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K “weka.classifiers.functions.-supportVector.PolyKernel
-E 1.0 -C 250007”
Logistic SimpleLogistic -I 0 -M 500 -H 50 -W 0.0
Regression
Neural Net MultilayerPerceptron -L 0.3 -M 0.2 -N 100 -V 0 -S 0 -E 20 -H 66

that the majority of the classification errors are caused in generate these results. Parameters for each classifier could
the ‘mips’ and ‘mipsel’ classes when we do not include our undoubtedly be tuned further for even greater classifica-
four endianess features and rely solely on the byte histo- tion performance.
gram. The dramatic improvement in F-Measure with these Finally, Table 4 shows the F-Measure of two models
features shows that they are indeed useful heuristics for classifying ‘mips’ versus ‘mipsel’ using a 64k bi-gram his-
determining endianess. Note that CUDA F-Measure scores togram versus our 260 feature byte histogram and endian
suffer from the small number of CUDA samples available features. Surprisingly, the bi-gram encoding appears to
within the dataset. preserve much less endian information than our simpler
These classifiers are mostly trained with their default heuristic-based method despite the much higher compu-
parameters. One notable exception to this is the Neural tational overhead of its larger feature vector.
Network classifier, which suffers from overfitting when
adding the endian features with the default network Sample size
structure of 260 $ 140 $ 20. A partial grid search over the
number of epochs and the number of hidden nodes suggest The above results achieve high accuracy using the every
a network configuration of 260 $ 66 $ 20 with 100 epochs byte of object code available within each sample. Another
results in performance in line with the other classifiers. See question is how large of a sample fragment do you need to
Table 5 for the full breakdown of all parameters used to achieve high accuracy. This is a useful metric for analysts

Fig. 2. 10-fold cross validation accuracy of the classiﬁers for different maximum sample sizes. Note that for both SVM and 1-NN, the accuracy approaches 90%
with only 16 bytes of sample data. By 8 KB, all classiﬁers are near or above 90% accuracy.
S162 J. Clemens / Digital Investigation 14 (2015) S156eS162

who often deal with incomplete fragments of samples. To Barrett and Charles Lepple of JHU/APL for discussion and
test this, we generate new feature vectors from our samples insight into previous research in this area. Additionally, we
using maximum sample sizes of four bytes up to one would like to thank the reviewers for their comments and
megabyte using the random sample methodology help preparing this paper for publication.
explained earlier. We then ran each of these size-based
feature sets through the models trained on the full-
References
sample instances. The results are summarized in Fig. 2.
These results show that for both the SVM and 1-NN clas- Anh QN. Capstone: next generation disassembly framework. USA:
sifiers, one can achieve very high accuracy even for tiny BlackHat; 2014.
amounts of sample data, and that by 8 KB, nearly all clas- Beebe N, Maddox L, Liu L, Sun M, Sept. Sceadan: using concatenated N-
gram vectors for improved file and data type classification. Inf Fo-
sifiers are above 90% accuracy. rensics Secur IEEE Trans 2013;8(9):1519e30.
Blanco A, Eissler M. One firmware to monitor em all. Ekoparty. 2012.
Discussion and further work Blem E, Menon J, Sankaralingam K. A detailed analysis of contemporary
ARM and x86 architectures. Tech. rep., UW-Madison. 2013.
Chernov A, Troshina K. Reverse engineering of binary programs for
We have shown that machine learning can be an custom virtual machines. Recon. 2012.
effective tool to classify the target architecture of object Delugre! G. Closer to metal: reverse engineering the broadcom netext-
reme's firmware. Presented at Hack.lu. 2010.
code. As this method is independent of potentially Fitzgerald S, Mathews G, Morris C, Zhulyn O. Using NLP techniques for file
misleading meta-data, it provides both a way to verify fragment classification. Digit Investig 2012;9:S44e9.
existing meta-data and a way forward when no meta-data Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The
WEKA data mining software: an update. SIGKDD Explor. Newsl
is present. We have developed heuristics that can be used Nov. 2009;11(1):10e8. URL, http://doi.acm.org/10.1145/1656274.
to predict the endianess of code. Of the classifiers tested, 1656278.
SVM and nearest neighbor approaches appear to provide Heffner C. Binwalk firmware analysis tool. 2010. Accessed 09.04.15. URL,
http://binwalk.org/.
good classification performance regardless of fragment
Kolter JZ, Maloof MA. Learning to detect and classify malicious execut-
size. ables in the wild. J Mach Learn Res 2006;7:2721e44.
Going forward, we would like to expand our current Li Q, Ong AY, Suganthan PN, Thing VL. A novel support vector machine
architecture dataset to include a more varied sampling of approach to high entropy data fragment classification. In: SAISMC;
2010. p. 236e47.
architectures. We intend to include more embedded plat- Li W-J, Wang K, Stolfo S, Herzog B. Fileprints: identifying file types by
forms, microcontroller code, and more GPU samples. We N-gram analysis. In: Information Assurance Workshop, 2005. IAW
will also include samples using different compilers than ’05. Proceedings from the Sixth Annual IEEE SMC; June 2005.
p. 64e71.
GCC, including LLVM/Clang and Microsoft Visual Studio, to McDaniel M, Heydari MH. Content based file type detection algorithms.
make sure that different code generation engines do not In: System Sciences, 2003. Proceedings of the 36th Annual Hawaii
effect the overall classification performance. International Conference on. IEEE; 2003. 10 pp.
Miller C. Battery firmware hacking: Inside the innards of a smart battery.
In addition to expanding the dataset, we will continue to Tech. rep., Accuvant Labs. 07 2011.
explore other areas to apply machine learning to binary Rad BB, Masrom M, Ibrahim S. Opcodes histogram for classifying meta-
object code. Two interesting areas of research include code morphic portable executables malware. In: e-Learning and e-Tech-
nologies in Education (ICEEE), 2012 International Conference on.
attribution, and automated reverse engineering techniques IEEE; 2012. p. 209e13.
such as determining function boundaries. We feel that Russ F, Muniz S. A python interface to the GNU binary file descriptor
machine learning could play an important role in (BFD) library. 2013. Accessed 09.04.15. URL, https://github.com/
Groundworkstech/pybfd.
advancing these research areas.
Sickendick KA. File carving and malware identification algorithms
applied to firmware reverse engineering. Tech. rep., DTIC Document.
Acknowledgments 2013.
Xie H, Abdullah A, Sulaiman R. Byte frequency analysis descriptor with
spatial information for file fragment classification. In: Proceeding of
The authors would like to thank Dr. Tim Oates of UMBC the International Conference on Artificial Intelligence in Computer
for guidance on machine learning techniques, and Brad Science and ICT (AICS 2013); 2013.

Practical C++ Backend Programming
From Everand
Practical C++ Backend Programming
Justin Barbara
No ratings yet
Rolls-Royce 20-25HP - Handbook - To - XVI - Chap2-2
No ratings yet
Rolls-Royce 20-25HP - Handbook - To - XVI - Chap2-2
10 pages
Top Networking Terms You Should Know
From Everand
Top Networking Terms You Should Know
JOHN SMITH
No ratings yet
In Depth Security Vol. III: Proceedings of the DeepSec Conferences
From Everand
In Depth Security Vol. III: Proceedings of the DeepSec Conferences
BoD - Books on Demand
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Penetration Testing Fundamentals-2: Penetration Testing Study Guide To Breaking Into Systems
From Everand
Penetration Testing Fundamentals-2: Penetration Testing Study Guide To Breaking Into Systems
Devi Prasad
No ratings yet
Kali Linux, Ethical Hacking And Pen Testing For Beginners
From Everand
Kali Linux, Ethical Hacking And Pen Testing For Beginners
BHARAT NISHAD
No ratings yet
Data-Driven Security: Analysis, Visualization and Dashboards
From Everand
Data-Driven Security: Analysis, Visualization and Dashboards
Jay Jacobs
No ratings yet
Machine Learning with Python: A Comprehensive Guide with a Practical Example
From Everand
Machine Learning with Python: A Comprehensive Guide with a Practical Example
MARTIN NEEL
No ratings yet
Mal Data
No ratings yet
Mal Data
9 pages
Penetration Testing Fundamentals -1: Penetration Testing Study Guide To Breaking Into Systems
From Everand
Penetration Testing Fundamentals -1: Penetration Testing Study Guide To Breaking Into Systems
Devi Prasad
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Metasploit Techniques and Workflows: Definitive Reference for Developers and Engineers
From Everand
Metasploit Techniques and Workflows: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
From Everand
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
Mustafa Al-Dori
5/5 (1)
Valgrind Essentials: Definitive Reference for Developers and Engineers
From Everand
Valgrind Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
STUDY GUIDE 300-220 CBRTHD Conducting Threat Hunting and Defending using Cisco Technologies for Cybersecurity
From Everand
STUDY GUIDE 300-220 CBRTHD Conducting Threat Hunting and Defending using Cisco Technologies for Cybersecurity
Anand Vemula
No ratings yet
Ethical Hacking Basics for New Coders: A Practical Guide with Examples
From Everand
Ethical Hacking Basics for New Coders: A Practical Guide with Examples
William E. Clark
No ratings yet
Reversing: Secrets of Reverse Engineering
From Everand
Reversing: Secrets of Reverse Engineering
Eldad Eilam
4.5/5 (16)
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
Detectron2 in Practice: Definitive Reference for Developers and Engineers
From Everand
Detectron2 in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
SystemTap Essentials: Definitive Reference for Developers and Engineers
From Everand
SystemTap Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
From Everand
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
Eric Vargas
No ratings yet
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
From Everand
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Python Network Automation: Automating Container Orchestration, Configuration, and Networking with Terraform, Calico, HAProxy, and Istio
From Everand
Mastering Python Network Automation: Automating Container Orchestration, Configuration, and Networking with Terraform, Calico, HAProxy, and Istio
Tim Peters
No ratings yet
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
From Everand
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
Mustafa Al-Dori
4/5 (1)
JFrog Solutions in Modern DevOps: Definitive Reference for Developers and Engineers
From Everand
JFrog Solutions in Modern DevOps: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Deep Neural Network Based Malware Detection Using Two Dimensional Binary Program Features
No ratings yet
Deep Neural Network Based Malware Detection Using Two Dimensional Binary Program Features
10 pages
Comprehensive Guide to Zipkin: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Zipkin: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Malware Identification
No ratings yet
Malware Identification
28 pages
Defense in Depth
From Everand
Defense in Depth
Qasim
No ratings yet
Artificial Intelligence 2024 Book 2 of 2: AI, #2
From Everand
Artificial Intelligence 2024 Book 2 of 2: AI, #2
Yang Yen Thaw
No ratings yet
Practical C++ Backend Programming: Crafting Databases, APIs, and Web Servers for High-Performance Backend
From Everand
Practical C++ Backend Programming: Crafting Databases, APIs, and Web Servers for High-Performance Backend
Justin Barbara
No ratings yet
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
Rust for Embedded Systems
From Everand
Rust for Embedded Systems
James Oakton
No ratings yet
Certified Ethical Hacker (CEH v12) Exam Preparation
From Everand
Certified Ethical Hacker (CEH v12) Exam Preparation
Georgio Daccache
No ratings yet
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Code Generation Techniques and Applications: Definitive Reference for Developers and Engineers
From Everand
Code Generation Techniques and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Neutralino.js Essentials: Definitive Reference for Developers and Engineers
From Everand
Neutralino.js Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical Pentesting Guide: Preparation for Certification and Ethical Hacking
From Everand
Practical Pentesting Guide: Preparation for Certification and Ethical Hacking
Evan Blake
No ratings yet
Statistics with Rust: 50+ Statistical Techniques Put into Action
From Everand
Statistics with Rust: 50+ Statistical Techniques Put into Action
Keiko Nakamura
No ratings yet
PyTorch Cookbook: 100+ Solutions across RNNs, CNNs, python tools, distributed training and graph networks
From Everand
PyTorch Cookbook: 100+ Solutions across RNNs, CNNs, python tools, distributed training and graph networks
Matthew Rosch
No ratings yet
PyTorch Cookbook
From Everand
PyTorch Cookbook
Matthew Rosch
No ratings yet
Ethical Hacking 101 - How to conduct professional pentestings in 21 days or less!: How to hack, #1
From Everand
Ethical Hacking 101 - How to conduct professional pentestings in 21 days or less!: How to hack, #1
Karina Astudillo B.
5/5 (5)
Practical Digital Forensics: Forensic Lab Setup, Evidence Analysis, and Structured Investigation Across Windows, Mobile, Browser, HDD, and Memory (English Edition)
From Everand
Practical Digital Forensics: Forensic Lab Setup, Evidence Analysis, and Structured Investigation Across Windows, Mobile, Browser, HDD, and Memory (English Edition)
Dr. Akashdeep Bhardwaj
No ratings yet
Software Architecture with Python
From Everand
Software Architecture with Python
Anand Balachandran Pillai
3/5 (1)
Artificial Intelligence for Cybersecurity: Develop AI approaches to solve cybersecurity problems in your organization
From Everand
Artificial Intelligence for Cybersecurity: Develop AI approaches to solve cybersecurity problems in your organization
Bojan Kolosnjaji
No ratings yet
50+ App Features with Python: Implement feature-focused, code-driven Python capabilities with UX at the core
From Everand
50+ App Features with Python: Implement feature-focused, code-driven Python capabilities with UX at the core
Ylena Zorak
No ratings yet
50+ App Features with Python
From Everand
50+ App Features with Python
Ylena Zorak
No ratings yet
Cybersecurity Key Topics: A Field Guide
From Everand
Cybersecurity Key Topics: A Field Guide
Dr. Betina Tagle
No ratings yet
Zenoss Core 3.x Network and System Monitoring
From Everand
Zenoss Core 3.x Network and System Monitoring
Michael Badger
No ratings yet
Snort for Network Security: Definitive Reference for Developers and Engineers
From Everand
Snort for Network Security: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Knowledge Reasoning: Fundamentals and Applications
From Everand
Knowledge Reasoning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Accelerated DevOps with AI, ML & RPA: Non-Programmer’s Guide to AIOPS & MLOPS
From Everand
Accelerated DevOps with AI, ML & RPA: Non-Programmer’s Guide to AIOPS & MLOPS
Stephen Fleming
5/5 (2)
Verilog for Digital Design and Simulation: Definitive Reference for Developers and Engineers
From Everand
Verilog for Digital Design and Simulation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Nessus Security Scanning Practical Guide: Definitive Reference for Developers and Engineers
From Everand
Nessus Security Scanning Practical Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive Guide to Nmap: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Nmap: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Radon Transform Based Malware Classification in Cyb 2024 Results in Control
No ratings yet
Radon Transform Based Malware Classification in Cyb 2024 Results in Control
14 pages
OpenStack Cloud Security
From Everand
OpenStack Cloud Security
Fabio Alessandro Locati
No ratings yet
Dataflow and Reactive Programming Systems
From Everand
Dataflow and Reactive Programming Systems
Matt Carkci
No ratings yet
Malware Detection Using ANN
No ratings yet
Malware Detection Using ANN
10 pages
Learning Docker
From Everand
Learning Docker
Pethuru Raj
5/5 (5)
RFLI (With Reviewers)
No ratings yet
RFLI (With Reviewers)
22 pages
of Sedimentary Basins - Notes
100% (1)
of Sedimentary Basins - Notes
44 pages
Multi CD800a Mje061 User
No ratings yet
Multi CD800a Mje061 User
1 page
Assume You Have Just Been Hired As A Business Manager
0% (1)
Assume You Have Just Been Hired As A Business Manager
3 pages
CYCLOPENTANE
No ratings yet
CYCLOPENTANE
2 pages
AA278A Lecture Notes 8. Optimal Control and Dynamic Games: Claire J. Tomlin May 11, 2005
No ratings yet
AA278A Lecture Notes 8. Optimal Control and Dynamic Games: Claire J. Tomlin May 11, 2005
12 pages
Niced Pastry Concept Paper Format For Feasibilty Studies 2025
No ratings yet
Niced Pastry Concept Paper Format For Feasibilty Studies 2025
4 pages
Quarter 3 Tle 9
No ratings yet
Quarter 3 Tle 9
5 pages
Computing Maze Game Activity Sheet
No ratings yet
Computing Maze Game Activity Sheet
3 pages
Gossypium Barbadense
No ratings yet
Gossypium Barbadense
12 pages
Banquet Menu
No ratings yet
Banquet Menu
3 pages
Child, You Have To Do It Now
No ratings yet
Child, You Have To Do It Now
69 pages
What Does Regenerative Air Pre-Heater Means, Why They Named So
No ratings yet
What Does Regenerative Air Pre-Heater Means, Why They Named So
10 pages
2022 I-95 Snow Incident of January 3-4 Performance Audit From Office of The Inspector General
No ratings yet
2022 I-95 Snow Incident of January 3-4 Performance Audit From Office of The Inspector General
29 pages
Unit 4
No ratings yet
Unit 4
15 pages
UTS Bahasa Inggris Ganjil Kelas 8 SMP
No ratings yet
UTS Bahasa Inggris Ganjil Kelas 8 SMP
5 pages
MAGNESITA CEMENT Folder 092015
No ratings yet
MAGNESITA CEMENT Folder 092015
24 pages
Kempe2005-NREL WVTR
No ratings yet
Kempe2005-NREL WVTR
7 pages
Types of Steel Beam Connections and Their Details
No ratings yet
Types of Steel Beam Connections and Their Details
5 pages
The Amazing World of Dictionaries
No ratings yet
The Amazing World of Dictionaries
7 pages
Xe155ucr Spec
No ratings yet
Xe155ucr Spec
20 pages
Compare Two Images
0% (1)
Compare Two Images
3 pages
Crack Waves
No ratings yet
Crack Waves
9 pages
Microsoft Powerpoint Tips and Tricks
No ratings yet
Microsoft Powerpoint Tips and Tricks
8 pages
TCNet Design Report
No ratings yet
TCNet Design Report
2 pages
3rd Year 1st Semester
No ratings yet
3rd Year 1st Semester
11 pages
UWC Robert Bosch College Vacancies For Fall 2021
No ratings yet
UWC Robert Bosch College Vacancies For Fall 2021
2 pages
El Deafo Teaching Guide
75% (8)
El Deafo Teaching Guide
3 pages
Jurutera CSD Sdn. BHD.: Consulting Engineers
No ratings yet
Jurutera CSD Sdn. BHD.: Consulting Engineers
6 pages

2015 USA Paper-Automatic Classification of Object Code Using Machine Learning

Uploaded by

2015 USA Paper-Automatic Classification of Object Code Using Machine Learning

Uploaded by

DIGITAL FORENSIC RESEARCH CONFERENCE

Automatic Classification of Object Code

From the proceedings of

The Digital Forensic Research Conference

Contents lists available at ScienceDirect

Automatic classiﬁcation of object code using machine

Motivation GPU-enabled programs, ﬁrmware for network cards and

We tested the theory that our features are sufﬁcient to

Architecture # Samples Wordsize Endianess

Trained model Multi-class Strategy WEKA name Histogram Hist þ Endian

1-NN Inherent IBk 89.3238% 92.7256%

Trained Model WEKA name Parameters

1-NN IBk -K 1 -W 0 -A “weka.core.neighboursearch.-LinearNNSearch -A “weka.core.-EuclideanDistance -R ﬁrst-last””

You might also like