This video discusses encoding schemes, particularly URL encoding and Base64 encoding, which are used to transmit data that doesn't conform to specific protocol rules. It explains how these encoding methods can be utilized for obfuscation in cybersecurity, making it harder for unauthorized users to identify sensitive information in network traffic. The video also introduces a helper function to check if data is likely encoded, demonstrating the process with examples and potential pitfalls in identifying encoded data.
Download as TXT, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
3 views
01_04_detecting-encodings-with-python.en
This video discusses encoding schemes, particularly URL encoding and Base64 encoding, which are used to transmit data that doesn't conform to specific protocol rules. It explains how these encoding methods can be utilized for obfuscation in cybersecurity, making it harder for unauthorized users to identify sensitive information in network traffic. The video also introduces a helper function to check if data is likely encoded, demonstrating the process with examples and potential pitfalls in identifying encoded data.
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 3
Hello and welcome back to this course.
In the past few videos, we've been talking
about identifying a good network protocol and fields within those packets for command and control. And the first of the three videos, we talked about the code that we use to accomplish this. In the previous video, we talked about entropy, one of our measures of suitability, and now, in this video, we're going to talk about encoding schemes. So encoding schemes were originally designed to allow data that doesn't follow the rules of a particular protocol to be transmitted over that protocol. So this could mean that in some cases, we have protocols that can only carry principal data. And so, if you have unprincipled characters, if you want to send them over that particular protocol, you need to convert the unprincipled characters to principal ones. And another case is where you have protocols that have reserved or special characters. So for example, in a URL, a question mark is a reserved character, and so if you want to use a question mark somewhere in the URL and don't want it interpreted, is that reserved character? You need to encode it, in a moment, we'll talk about URL encoding or percent encoding which is designed to do exactly that. And so, these are the original purposes for various encoding schemes. However, they are also commonly applied, especially in offensive cybersecurity for obfuscation. So for example, if you're sending a username and a password or other sensitive data over the network, then it's easy for anyone to monitor that network traffic. And identify, okay, if I do a keyword search for username or password, I found the packet that I want and see that data. However, if that username or password is encoded, then that keyword search won't match unless you know to reverse the encoding. And so, we're talking about encoding schemes here because if we're going to use a network protocol for command and control. And put our data in a particular field, we might want to have the option to encode that data. And if so, it would be useful if we choose a field where encoded data is not unusual if possible. And so, in this video, we're going to talk about two encoding schemes, URL encoding and Base64 encoding. And so, our main function here or the helper function is called check encoding, so we'll give it some data, and they will tell us whether or not that data is likely to be encoded. And so, our first test is if the length of the day to zero, then return false because zero length data can be successfully decoded by any scheme, so it would be confusing. If we have a non-zero length data, we're going to check for URL encoding and Base64 encoding. If we find that it matches our rules for those, then we'll return either URL or Base64 respectively. And that will go back to our traffic analyzer script we looked at a couple of videos ago, which includes that information that's output as we saw. So let's talk about URL encoding first. So with our URL encoding, we're going to focus on things that are completely encoded. So often in the URL, the only characters that are encoded are the ones that break the rules, the ones that are reserved characters. So you might have something that's mostly principle, and then the occasional encoded character. And so, we certainly could use that approach for command and control by randomly encoding characters to break up text matching, and we could easily modify this code to look for those opportunities. However, in this case, we're just going to look for something that's completely URL encoding. And so, URL encoding gets other name, percent encoding from how it encodes data. So each character that's encoded in the string is written as a percent sign followed by the hexadecimal representation of the corresponding asking character. So for example, a space which has an x value of 20 would be represented as percentage to zero in percent encoded. And so, for our check URL encoding function here, we're going to look for things that match a rule that says it should be a percent followed by two hexadecimal digits, followed by potentially more of the same that percent x has. And we're going to use python's ARI library to do that because it lets us match the string using regular expressions, and here is our regular expression that we'll be using here. So starting in the middle here, let's take a look, so we've got our percent sign that we want to match, and then we have this section in square brackets, so square brackets mean any of these. And so, this particular section says if it is a number 0-9 or capital A through F or lowercase A through F, then match that character. Because those are the allowable values for hex values, and then we also after that have this two in curly braces, and so what this means is match exactly two of whatever's previous. So we have a percent sign, something that matches a hex character, and we want to of those which would match something like percent to zero, which is our URL encoding for a space. And so, all of this is wrapped up in a set of parentheses, saying treat this all as one unit, so we only want to match if we see percent, our hex, hex, percent hex, hex. And then, we want one or more of them, so if we can't match at least one, we want to return false. And so, then, we pass in our data and if the entire string of data that we pass and matches this, so it's percent hex, hex, percent hex, hex etcetera. Then, we return true saying, yes, it is URL all encoded, otherwise, you return false saying, well, it doesn't match our rules. So it's entirely possible that it is a field that uses URL encoding, but only some characters in URL encoding, the ones that are reserved. And because we're using full match for this, we won't match, but we could modify this to allow partial URL encoding if we chose. The other and more difficult one that we want to test for is Base64 encoding. So Base64 encoding gets its name from the fact that it uses 64 characters as an alphabet for it's encoding data. So those are alphanumeric characters, so capital A to Z, lowercase A to Z, 0-9, and then a couple of special characters. And so, if you add that up, number of letters, double that, add 10 for 0-9, and then add 2, you get 64. And so, the simple way to test for Base64 encoding is to try to decode it and see if it fails, so python has a Base64 library from which we can import Base64 decode. And so, if we do B64 decoded data, and it decodes to a plain text, we'll return true, meaning that it could be Base64 encoded. If something goes wrong, that means that it wasn't a valid Base64 encoding, and so we'll return false. And so, as we're going to see in our main function when we run this in a moment, this is a bit of a shaky way of testing. And the reason why is we don't know the data that's stored within our Base64 encoding data. So all we're testing for is does it decode to something in Base64, which just essentially means that it's a multiple four characters. And it's limited to those 64 character alphabet that I just mentioned, or it ends with one or two equal signs, which are used for padding in Base64. And so, down here in our main function, we have three messages that we're going to check our encoding for. So we'll use Hello World, that's actually Base64 encoded, we'll use URL encoded string, so see the percent hex, hex et cetera. And then, we'll use the strength FFFF, so eight apps, and so for each of these, we'll call check encoding, and we'll print out the results. So now I'll call python CheckEncoding.py hit Enter, and we see our three results. So Hello World, B64 encodes to this string here, and on testing that, it determines, yes, it does successfully decode to something. When we use our regular expression to test for URL encoding, this matches because we have our percent sign, two characters that are valid hex characters percent to valid hex, etcetera. And so, those are both good because they mean that in the correct case or the positive case, we successfully identify something, that's Base64 encoded and something that's URL encoded. However, we also get some false positives, so FFF, essentially F eight times, is technically a valid Base64 encoded string. However, it's not a particularly useful one if you decode it to the plain text because it's just the same thing continuously, and so this probably wasn't actually intended to be Base64 encoded. It's probably padding or something else, however, we match it as a valid Base64 encoded string because it is decodable. And so, without knowledge about the plain text that goes into it and what's considered a valid plain text, which we don't necessarily have when we're analyzing packet fields. Then, we can't be 100% certain if our result for decoding actually means that this field carries encoding data or if it's just that it happens to be decodable. But identifying things that say the majority of them are decodable indicates that we might have a field where it actually is encoded which would be useful for command and control. And so, again, this is just one of the two helper functions that we are looking at in relation to the traffic analyzer script from a couple videos ago. For identifying fields and network packets that might be useful for command and control infrastructure. Thank you.