0% found this document useful (0 votes)
11 views

01_03_calculating-entropy-with-python.en

This video discusses the importance of entropy in traffic analysis for command and control data transmission, focusing on how to identify suitable fields for data storage. The fieldEntropy function is introduced, which calculates the entropy of data fields, specifically targeting string, bytes, and bytearray types while ensuring a minimum length for effective data transmission. The video demonstrates the calculation of entropy using both structured English text and random bytes, highlighting the differences in entropy values and their implications for data obfuscation.

Uploaded by

rasha.ziad.share
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

01_03_calculating-entropy-with-python.en

This video discusses the importance of entropy in traffic analysis for command and control data transmission, focusing on how to identify suitable fields for data storage. The fieldEntropy function is introduced, which calculates the entropy of data fields, specifically targeting string, bytes, and bytearray types while ensuring a minimum length for effective data transmission. The video demonstrates the calculation of entropy using both structured English text and random bytes, highlighting the differences in entropy values and their implications for data obfuscation.

Uploaded by

rasha.ziad.share
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 3

Hello and welcome

back to this course. In the previous video, we talked about performing some traffic
analysis
for command and control. Our goal there was to look at the various protocols,
types of traffic, packets and packet fields, and try to identify a good option for
command and control. We want something
that's rather common, has some space for storing
command and control data, and where the data that we're going to transmit
doesn't stick out. That third objective,
not sticking out is what we're going to
be talking about a little bit more in this video. In that video, we talked
briefly about entropy, which is a measurement of
the randomness in data or the amount of information a particular piece
of data can encode. For what we're doing, we want the ability to have some
relatively high
entropy values. Because if you have data
that's relatively same, or it all looks the same, things like English text, they
have relatively
low entropy. If you're used to having low entropy data like
English texts in a field, and suddenly you have
something that's got a bit more randomness to
it, like computer code, then that computer
code might stand out a little bit in
that particular field. As part of our traffic analysis, we looked at the entropy of
various fields using a
function called fieldEntropy. Today in this video, we're going to be
looking at what that fieldEntropy
function actually does. Let's take a look at it now. Our fieldEntropy function is
designed to calculate
the entropy of a field. In the previous video, we briefly mentioned that
we put some constraints on that field that we're going
to calculate entropy for. The first is that
we were looking for fields of certain types. When we're trying
to transmit data over command and
control infrastructure, we need some space to
actually send more than a single bit or a small
value in each packet. The less data we can
fit into each packet, the more packets
we have to send, and the more noticeable it is. One of our constraints
is that we want fields that store
certain types of data. Strings, bytes and
bytearrays are three different
Python data types that essentially come down
to here's a set of bytes. They might be
interpreted differently, but it says that, we might be able to store multiple bytes
of information
within a single field. This section of the
code here is testing if we have a field of
type string, bytes, or bytearray, and if so we want to convert
that to a bytearray. Because a bytearray in Python
essentially just a list of bytes as opposed to
something like a string, which is interpreted
as a string. With our list of
individual bytes, we can treat each byte as a independent occurrence and
calculate the entropy on it. Our other constraint
that we talked about is we wanted
a minimum length. If we can only store a single byte of
information in each packet, we're going to need
a lot of packets to achieve our goals. In this case, we've set our
minimum length to five, but we could use something
larger if we chose. If our length is greater
than the minimum length, then we're going to calculate the entropy of that
bytearray. To do so, we're going
to use a couple of Python libraries called
Pandas and SciPy. Pandas gives us the ability
to represent data as a series and then
count the number of occurrences of each
item in that series. We'll take our
bytearray of data, which might be a string, it might just be an
array of bytes, etc. We're going to store it
in a series data type. Then we can call
value counts to say, how many bytes with
value zero do we have? How many bytes with value
one do we have, etc. In the end we should have a array which says for
this particular value, we had this many occurrences. These counts aren't
quite probabilities, but they're pretty close. When we're calculating entropy, we
use observed probabilities, and we might compare those
two expected probabilities. For example, if you're
flipping a coin, the probability should
be 50 percent that it is heads and 50 percent that it is
tails for every coin flip. That's the ideal case. Some people can force a coin to
flip a little
bit more heads than tails, etc based off of few
different factors. But in theory, half heads, half tails. That's the theory. In
practice, if you
flip a coin 100 times, the odds aren't great, that you'd get exactly 50
of heads and 50 of tails. You might have a 51, 49 split
or something like that. Those are our observed
probabilities in this particular case. When we're talking about the entropy of our
packet fields, we don't have our
expected probabilities. We could say that
there's a one in 256 chance of each
byte occurrence, and that's probably true. But what we do have is
observed probabilities. We say, out of the 200
bytes in our byte array, five of them are the letter A. That observed probability
is
5/200 or 140th and that's our observed
probability of A versus our theoretical probability
of 1 out of 256. Based off of all of these
observed probabilities, we can calculate the entropy
or the amount of randomness or amount of
information that can be encoded in our byte array. Ideally, we should have observed
probabilities
pretty close to 1/256 for every single
potential byte value. That would give us a nice high entropy and
tell us that we can put any type of obfuscated
data in our field. However, if we have something
like English text or Spanish text or Latin text
or pick a language text, we have a much lower
entropy because there's a lot of structure to languages, so the amount of
randomness
in a string is much lower. If I say have the phrase
H-E-L-L-O W-O-R-L in English, you know what the next
letter is going to be. There's no randomness to it. Therefore, there's less
information encoded in that string than if we have something that's
completely random. To demonstrate this, I've
got a few lines here that we're going to use our
field entropy function to calculate the randomness of. We're going to look at
the string "Hello world." Then we're going to look
at an equal length string, 12 bytes, but they're
going to be random. We'll import from
the random library the rand bytes
function to do that. Same calculation in both cases. One of them's English texts,
one of them's completely
random bytes. If we run our entropy function
here, give it a moment, we see that the entropy
of our random bytes is higher than the entropy
of our English text. If we run this repeatedly, got lucky in this case, because
what I wanted
to demonstrate was, in most cases our entropies are going to be the same
here for both of these because hello
world doesn't change and this is the entropy of our 12 random bytes
where they're all unique. However, down here we get
a different entropy and the reason why is in
this random string, there are repeated
bytes somewhere. Here we go. We see
a xfc and a xfc. Because there's a
repetition there, our probabilities are
slightly off of the 1 out of 256 that we're expecting and our entropy goes
down a little bit. We can run this repeatedly as many
times as we want and we see here that our
entropies are the same in the first and the third, meaning that we
got one occurrence of each byte that we see, which is about what we'd expect. This
demonstrates just
as Entropy.py function, which is just designed
to determine if a particular field
meets some criteria, the right types of value stored
in it, and enough length. Then also to calculate the entropy of the
data in that field to see what we might be able
to store in it. Thank you.

You might also like