0% found this document useful (0 votes)
22 views16 pages

ELK 2 4 - Logstash Filtering - Unstructured Data

This document discusses parsing unstructured log data using Logstash and the Grok filter plugin. It provides an example of an unstructured squid proxy log and demonstrates how to define patterns to extract fields like timestamps, IPs, URLs and other values into a structured format for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views16 pages

ELK 2 4 - Logstash Filtering - Unstructured Data

This document discusses parsing unstructured log data using Logstash and the Grok filter plugin. It provides an example of an unstructured squid proxy log and demonstrates how to define patterns to extract fields like timestamps, IPs, URLs and other values into a structured format for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

In this lecture we're going to continue our discussion on filtering in Logstash.

However this one's going to be a little bit more challenging. With the last lecture
we looked at structured data. It was a very simply defined CSV, file just a few
fields, and they were already mostly in the form that we wanted them in, very
straightforward.

In this lecture we're going to focus on unstructured data, and you're going to be
running into a lot of this over the course of your DFIR and NSM work. When I say
unstructured data I mean data that's in a form that is not so easy to parse out, not
something that we can super easily just feed to a CSV plugin and tell it what the
column headings are.

In fact we actually have to build specialized patterns using what's really kind of an
abstraction of regular expressions and actually maybe in some cases use some
regular expressions to parse these pieces of data out.

Unfortunately doing this type of parsing of unstructured data is often where


you're going to have the most headache in, where you're going to spend most of
your time getting things how you want them indexed into your elk stack.

That's just kind of the name of the game however the more that you do the better
of it at it you're going to get. And you'll actually eventually build libraries of these
things. Fortunately some other smart folks have built libraries of these out there
that you can use yourself and pull upon in your work

No sense in reinventing the wheel when you don't have to, but here we're going
to learn how to parse this unstructured data into a meaningful way, and we're
going to do that similarly to how we did the last exercise where I’m going to show
you some real world data and we're going to parse that into our lab system.

So ideally you're going to follow along with this I’ll provide all the sample data and
configuration data you need so sit back and you can watch me do this and then
you can try it yourself afterwards.
In this example the unstructured data we've been handed is actually a proxy log
you see I have it here in ‘var/varlog/proxy.log’. And this is an example of a squid
proxy log, a very common proxy that a lot of places use free tool. So in this case
I’ve got the log, I’ve got an example of the output here. You can kind of get a
sense of what's going on.

Here at the beginning we have a Unix timestamp, we have a number field which is
actually the duration the request took, we have a source IP address, we have the
status that comes from the proxy along with the http response code, we have the
number of bytes transmitted, the HTTP request method, the actual domain and
URI jammed together here, we have some information included with the
destination IP address, and then we have the content disposition or the type of
content that was downloaded here it was ‘tech/plain’.

So we have a few of those logs on the screen I’ll leave those on here so you have
those as we're going through this but also I’ve provided the sample data to you
that you can look at directly as well.

This gets a little jumbled here because some of these are longer so the second
record takes up multiple lines we see that down here as well.

But you should get the idea, or get the sense of what's going on. Now for our
configuration file we're going to start similarly to how we started with all these
others. This is the exact same as the previous configuration file again we're using
the SINCEDB_path variable to make sure that we can reindex this without much
trouble.

Now if I pivot over here to Kibana, we can look at the data as it's indexed using
the configuration file we're using. Again very straightforward. It's pretty minimal.
We have the message field and the entirety of each line is parsed into this
message field, so we can't do anything on it. We can't do any aggregations, we
can't do any manipulation on it, we have to break this up, and so this is the part
that gets a little tricky.This is where we're going to take our unstructured data and
parse it into structured data parse it into fields that we can actually do something
with and manipulate a little further.

So we're going to spend some time focused on how to parse that data into those
fields and then we'll also afterwards play around with a few additional filter
plugins that maybe we haven't looked at thus far.

To do the parsing we're going to use a filter plug-in that you're either going to
come to love or hate. And that is the Grok filter plug-in and Grok, very simply as it
says, parses arbitrary text and provides structure to it. It's really as it says the best
way in Logstash to parse, and it doesn't mince words here crappy unstructured
log that into something structured and queryable.

So that's exactly what we're going to do here. We're going to use this Grok filter
plugin. The documentation on this is great I certainly recommend reading through
it, but I’m going to kind of give you the crash course through it here, and then
we're going to use this to to filter and parse our unstructured data in this actual
example.

Now the key thing to understand with Grok is it's really just kind of an abstraction
from regular expression. We all kind of know what regular expressions are. There
are ways of filtering data based upon very specific syntax at some point if you
haven't already you will have to use regular expressions in your INFOSEC career,
whether that's performing searches whether it's writing ids signatures and so on.

Now when I say it's an abstraction based upon regular expression. I say that
because thanks to Grok you don't always have to write regular expressions and
that's because Grok has a series or predefined patterns and they're basically
aliases and I’ve got a list of those here.

We're looking at the Logstash github repo, where we have these these aliases so
to speak.

So for instance at the top here we see we have one called username and it's
defined with that name username and by just saying username we're actually
saying; hey, use this regular expression! Or by saying base 10 number we're using
this regular expression

So we're able to use these shortcuts so we don't have to rewrite all these things
and there are quite a few of them here. It's not all encompassing but there are a
lot of them. So it allows us to use this syntax.

Of course we can define our own regular expressions if we want to there's a


special syntax used for that. We can define these to our heart's content, but if we
don't have to, we don't want to because it's no fun.

So we're going to try to use these as best we can in specifying our filters. So let's
talk about how we apply these patterns using Grok. With Grok what we're trying
to do is match individual fields and we want to match them in a row to build our
pattern.

So we want to match the very first field in the log file using one of these Grok
patterns that we defined and map it to a specific field.

So the syntax is this it is % sign, and then you open curly brace ({), syntax, colon,
semantic.

So what do we mean we say syntax semantic?

Well, syntax very simply, means specifying one of those patterns I just showed
you. I just showed you username, and base 10 number and so on, it's specifying
one of those and semantic is specifying the field you want to name it.

So I don't really love that. That's the way its name, because it's kind of confusing,
whereas it could just be pattern colon field, but it's syntax semantic but either
way, let me show you an example. In this case let's say the field we want to match
is ‘csanders’.

Well, that's text and we'll actually match the word pattern that is provided via
Grok. So we would specify word as the pattern we're matching, and put that in
the username field. So that would match that particular field.
Let's try another one 192.168.1.14. Well, Logstash and Grok actually provide a
IPV4 pattern. So we can specify the ipv4 pattern. That's going to apply the
appropriate regular expression and we're going to put that in the SRCIP field.

Let's look at a couple more. 29281, well, that's a number and we can put that in
the number pattern, and let's say in this case it was bytes. I know you might have
thought that was a zip code. But I tricked you. That's actually a byte number.

So we're putting that into the field bytes. What about this one
‘/index/source/file.php’? Well, that's a URI. So we're fortunate because Grok also
has a URI pattern and in this case we put it into the url field.

So Grok has a lot of useful patterns like this for us. It has them for paths on
windows and Unix, protocol numbers, hosts, URIs, date values, year hour
minutes, so on, specific date formats, it has a lot of these different patterns.

And you're just going to have to kind of keep that reference I showed you handy
and I’ll provide the url to this of course you saw it in the video just a second ago
but keep that handy and you're going to have to look at that pretty constantly as
you're going through and building your Grok patterns.

And pattern again is the key word here. In these examples I’m showing you how
to match individual fields, but we're actually matching a series of fields and that's
what makes up the full matching pattern that we're using. So when I say building
patterns, let's take a look at this example.

Here we have an IP address, an HTTP method, URL, and a couple of numbers,


maybe byte count and duration in this case. What would the pattern look? Like in
this case it would look like this. So notice that in our example log, the fields are
separated by spaces, so in our pattern here the different identifiers the different
patterns we're pulling upon from Grok are separated by spaces as well.

So we're saying that the first value that Grok gets to is an IP, and we're putting it
in the client field. Then there's a space, then the next value is a word and we're
putting it in the method field, another space that we have a URI path parameter,
going into the request field, another space a number which is going into the bytes
field, and finally another space, a number going into the duration field.

So you can kind of see what we're doing here, we're kind of building regular
expressions, but we're using some shortcuts to do that. And that's what Grok
allows us to do and what one thing that makes it quite a bit easier.

So hopefully that made a little bit of sense. I think it'll make a lot more sense once
we actually go through our example. So let's actually take this knowledge, and use
this to build out our Grok patterns for the data I showed you just a minute ago the
squid proxy log data.

To build our pattern I’m actually going to use a third-party tool that I really like,
and it's this website here ‘grokdebug.herokuapp.com’. this is a very simple app
that allows us to build our pattern in real time, and test it.

If we were to build this straight into our config file we would have to constantly
be restarting and deleting our indexes and it would just take forever. So this is
actually a lot better way to do that. Here I’ve pasted an example log line from our
log file up here at the top. And what we can do is down here where it says
pattern. I can start typing out the pattern as I build it and it'll parse it in real time
and show me the results down here.

Now the idea generally you would actually paste quite a few log files up here, or
log lines I mean. For our purposes they're all pretty much the same I generated
the log file myself, so I’m just going to do one just for the sake of simplicity.

So actually go ahead and start typing in our parsing pattern here, and we'll start
with the first field of course which in this case is a number this is actually a Unix
formatted timestamp.

So we want to parse that as a number since it is just a number in this case and
we'll use the date plugin later to convert that into the appropriate field.

So we're going to specify the % sign, open up our curly braces here, and
remember it's going to be pattern, colon, and then fill name. so I’m going to parse
this as a number and put it into the ‘datetime’ field. Notice when I do that, it looks
like it parses here. We actually get results. If it didn't parse, we wouldn't get any
results.

So down here it's parsed. It goes into actually two fields ‘datetime’ and
‘base10number’, which is a product of just what's going on here. But we see that
this in this case it looks like it parsed correctly.

Following this, we have another number and this is the elapsed time (how long
the request took). So we're going to parse that into a field called elapsed and we
will parse that with the integer pattern. Now I want to show you something here.
Notice when I put this in, we get no matches, and that's because what we've just
added here doesn't match what comes next in this log line.

And I wonder why. Well, I only put a space here, and it looks like there's at least
more than one space, but actually when I highlight this, you can see that's actually
a tab.

So to specify a tab here just like you would with a regular expression, we're going
to do ‘\t’ and notice now we parse. So now we have an ‘elapsed’ field with the
value 757, so you see it looks like we have tabs separating this, so we're specifying
those via ‘/t’.

Let's try another one of these and it looks like the thing we have next is an IP
address, and of course as we saw in the example earlier we have a type for this.
We have a pattern that we can use for this.

And we will parse that as such. And let me make sure I’ve got this configured just
right here. Yeah it looks like I forgot the % sign. That's important so I put that in
there it looks like we're parsing that via source IP as well. After source IP we have
another tab, and then we have something interesting.

We have this field which shows us the proxy status and the HTTP response code.
Those are kind of here together and we want both of those, but we actually want
them in separate fields, so that's what we're going to do here.
I’m going to put another % sign, open up our brackets, and we're going to say that
we want to parse out a word

in this case to proxy result looks like

it's going to get that down here we see

TCP_

underscore miss and it looks like after

that we have a forward slash let's

capture that forward slash. And then we want to parse out an integer which is our
HTTP response code and sure enough we get that now. I’m doing this on the fly
pretty quickly and that's just because I know what these types are, and obviously I
practiced this before I recorded it.

But again what you'll be doing here, iteratively speaking, is you'll come to these
fields one at a time. So you'll come to 507. Well that's a number. What can I put a
number into? Then you'll go back into Grok patterns here, and we have a couple
options. You can just do number you can do BASE10NUM, you can do integer, it
doesn't always matter specifically.

So there's the terms of saving memory, saving the amount of putting things in the
appropriate fields, it matters to some degree. But for our purposes I’m just going
to get these into something that works. So we see we have integer, which we can
use. So let's go ahead and specify that we have another tab, and we want to put
that into an integer, and we want to call that (let's call that) bytes_xferred, since it
is bytes transfer.

Looks like that parsed successively for. So we're making progress, we've only got
just a few more fields left, so let's go ahead and get those knocked out in this
case. So looks like we've got another tab, and then we want to parse out the HTP
request method.
So that is a word and we will call it (let's call it) http_method. I’m going to scroll
down since this is getting a bit long. It looks like we got that. We got post. After
that we have another tab, and we have the URI. In this case let's go ahead and
parse that out, and we'll parse it up to the ‘url’ field. Let's say that's the common
name that we're using elsewhere, and then we have this field, we have this
DIRECT thing here.

That's an artifact of the way that squid logs this data. We don't actually want this.
We're actually going to drop this field later. But we'll go ahead and parse it out, so
that we have a field we can actually do that with.

We actually do want this. It's combined here with our destination IP address
which we certainly want. So let's go ahead and get this knocked out here and to
do this let's parse this as a word. This is the ‘squid_handle’, so we're going to just
call it squid handle. After that we have a forward slash. We have another IPV4 and
we're going to parse this into DSTIP, and then we're going to use a little bit of a
trick here. We're going to do another ‘\t’ for our final tab here and parse the last
field which is our content type field.

It looks like this is from an XML type, and we're going to use the GREEDYDATA
type, and map that into content_type. If we go down here into look for
GREEDYDATA you see it's a very simple regular expression.

It's basically just getting everything else that's left, and that's just a cheap way to
grab the last thing I’ll show that to you, because it's effective and it works.

So we scroll down here, it looks like everything was parsed appropriately. We see
our URL is parsed as a URL, but it also gives us some other additional fields to see
what that might look like if we parse it that way. So we see our proto user
username you are a host hostname, etc.

So that's parsed out a couple different ways. There will be those other types. Let's
see, we can go down here, we can see squid_handle is DIRECT, we're going to
drop that field later, so at least now we know how to reference it.
We have DSTIP which we're parsing out here as 174.129 and so on. And finally we
have content type, which is application XML. So everything looks good. Everything
is parsed out. We've accounted for everything in this sample log, and ideally if we
paste another sample log into here, we'll get the same results.

I’m not going to do that now. You'll just have to trust me that it works. So what
we have is a completed Grok pattern that will parse out this particular log file.

Now one thing I want to say here, don't get discouraged if this takes you a while. I
did this really quickly here, but I build these things quite often, and it isn't that
quick it's a little bit of a process.

So even though I did this on screen in this recording here in just a few minutes, it
actually took me closer to half an hour to build this. So it takes a little time, it
takes a little playing around with. And you have to get things just right to make
sure it works perfectly.

So it's going to take you some time. Especially starting out, it may take you an
hour or longer to get something like this together. That's okay! The more you do
it the better you're going to get. And the more you learn about these pre-built
patterns that we've talked about the better you're going to be and the quicker
you're going to be able to get those plugged out.

So we're going to take this pattern. We're going to add it into our Logstash
configuration as a Grok filter. So let's go ahead and do that now. To make this
happen I’m going to edit the squid.config file that we've been working with.

We're going to go down here to filter, and I’m going to specify the Grok input
plug-in, which is the plug-in we'll be using here. Now for Grok we're really only
going to use one primary option and that's the match option, because that's what
we're doing. We're performing a match with the data that we're looking at.

That's going to sit in its own array and within that we're going to tell it to operate
on the message field. So you have to specify the field you're using this with, and
for us it's the message field which is the only field we have since we're putting the
whole log into that as of now.

And then we'll put that into quote, so what I’m just going to do is just paste that
in here. It's quite long as you saw when we built the thing. So we pasted that in
here, and I’ll go ahead and save this and we should be good to go. I’m going to go
ahead and do one more thing in here. I’m actually going to drop the message field
once we're done here.

And you might remember I added the remove field option. I believe under the
mutate plugin earlier this is actually an option that exists under multiple plugins.
It's a common parameter, so we're actually going to go ahead and just put it here
for simplicity's sake. And we're going to drop the message field. We're going to
remove that so that we should be left with just perfectly good wholesome parse
data into all of its fields. That's what we're going to hope we have.

So let's go ahead and restart Logstash and we'll see what we got. Okay, well, I tell
you what it looks like we were successful. So I see here we don't have a message
filled anymore, but we do have all the fields we expect to see. They're not in the
order we parse them out in. They're in a different order here, but we see
http_response SRCIP, bytes, transfer proxy result, URL, time elapsed, looks like we
have the datetime field, http method, and content type. We have DSTIP and
squid_handle over here as well.

So it looks like we have everything. Looks like everything parsed out successfully. I
don't see any errors. If there were errors you would see a tags field added here,
and that tags field would have something that said that there was a Grok parsing
failure.

You can actually search on those, and that's something common to do, because
especially if you have large diverse logs you might match 99% of your logs but 1%
of them might not parse correctly. And you want to know how that is.

So you can apply the Grok filter search for that tags field for the Grok errors and
then you can look at those logs and see why they might not have worked, paste
those over here into the Grok debugger, and tweak and play around a little bit.
That's pretty common. Don't get discouraged. It's amazingly common that you will
have some unparsed logs.

Our example here pretty simple it looks like everything parsed. I don't see any of
those tags here. Just doing a quick scroll of course, I can search for those as well.
But it looks like our data is parsed and that's really the tough part of this battle.
It's making our unstructured data into structured data and we have that now.

So now we can treat it the same way we treated things in the last example. We
can apply any number of filters to those, so let's actually go ahead and apply a
couple of those. Now we'll go ahead and delete our existing index and flip back
over here to our console I’ve gone ahead and closed the log example at the
bottom we don't need that anymore.

And we're going to go ahead and add in a couple of additional filter statements.
Let's start with one you've seen before, which is the date plug-in and we want to
do that because we don't want to date fields. We don't need the index time. We
actually just want to use the one that comes along with the field or with the data
we're parsing.

So I’m going to specify the datetime field that we parsed out of the unstructured
data and this is in Unix format.

So we'll specify that and we should be good to go there. That will take the
content, the datetime field, and put them into the timestamp field in ES.

Now we have a few fields we want to drop. One of those is the datetime field,
because we don't need that anymore. We don't need two instances of time. We
also want to drop our message field, and we want to drop the field we explicitly
parsed out just for that purpose which is the squid_handler field. We don't need
that for our purposes here.

So to do that I’m going to use the mutate plugin here, and we're going to use the
remove_field option.
Now remove_field is an array, so we can actually specify multiple fields in here. So
I’m going to go ahead and specify message, squid_handler, and date_time. Those
are all fields we want to drop. We don't want to index those, so that should get us
taken care of there.

Now I want to do a couple things you haven't seen before all within the mutate in
filtering plug-in.

First I want a lowercase of the URI field, and to do that I’m going to use the lower
case option, and I just need to simply specify the url field name there, in the
array.

What I’m doing here is just lowercasing that. That way we don't have problems
later on when we're searching, and if we happen to be searching in a case
sensitive manner, we don't want to miss a URL just because one of our data
sources happen to have that upper case. I see that mess people up a lot.

Not to say that you always want to lower case these fields, but you at least want
to be aware of what case the data you're searching exists, and when you're
searching that's certainly an important thing. So using the lowercase option and
the mutate plug-in will allow us to do that.

I want to do just one more mutate plug-in option that you haven't seen yet, and
that is ‘convert’. And convert does well exactly what it says it does: it converts.
But in this case it converts data types.

So I mentioned earlier that unless otherwise specified, when things are indexed
into ES, they'll get indexed as strings when doing it through Logstash.

And that's great for many things, but if you look at our data, we have a lot of
numbers here. And we might want to perform numeric operations on those such
as statistics, addition, subtraction, and so on.

In order to do that, those things have to be configured with number based data
types. So we actually want to do some manual converting there to make sure
we're getting that done. To do that we're going to have to create an object here,
and we just specify the fields we want to convert.

So we'll start with the elapsed field, and it has a decimal point. So we will specify
that it is a float, bytes_xferred. We'll call that an integer, and the http_response
code, we might probably won't end up doing any mathematical operations with
that, but it nonetheless is an integer, so we'll cast it into the right type.

All right, so we should be set there. We've used the date plug-in and the mutate
plug-in to do some additional data massaging here to make the data a little bit
more ready for the analysis we want to perform.

So before we go ahead and check that one thing I do want to do let's go ahead
and tail the Logstash log here and look for any errors. It doesn't look like there are
any. It says the pipeline started. If there were errors we would get them here.

I do this because I did something here in this demo that I would never encourage
you to do. And I made a whole bunch of changes to the config.file at once.

Especially while you're learning make one change at a time and then check this
log file. And check to get that you're getting what you expect to get because
otherwise if you make a whole bunch of changes like I just did, and something
messes up, you're not going to know which one it is.

The log file will provide some input on that, but it can be a little hard to have to go
back and trace that back and remove things one at a time.

So perform your changes iteratively not like I did here, but for the sake of time
I’ve done them all here at once.

So all I really need to do now is restart Logstash and we should be good to go let's
go ahead and do that, and let's make sure that they were able to get the data in
the format we were expecting.

If we look at the data here, let's see if it matches what we expect based upon the
changes we made.
First of all I don't see any of the fields that I told us to drop. So that's a good thing.
I don't see message, squid_handle or datetime. Those are all gone.

Speaking of datetime, since it's gone, let's see if it was parsed correctly. It looks
like it was. it was parsed over into the timestamp field, so this is the datetime that
was in the logs themselves and not what the time they were indexed was.

In terms of casing, it looks like the ‘url’ here is lower cased. I don't think too many
of them were uppercased in the first place. So we might not be able to
immediately tell that that occurred, but it should have worked the way we input it
there.

Beyond that the only other change we have was the conversions. In terms of
making sure things converted into the proper forms and it's hard to tell here if I
go into management and index patterns. Let's see what we've got here.

Yeah, so it looks like elapsed is now a number as is http_response, and so on. So


those are appropriately configured as numbers, and the data is now as we
expected.

We can search it, we can perform really nice aggregations on these things, for
instance here's the breakdowns of http methods. It looks like we have mostly
posts and a few gets the responses mostly 200s. we have some 204s here which is
interesting.

So we have data that we can manipulate, we can do what we want with, and this
was a little bit harder. We had to go from unstructured data and make it
structured, but we got the results we wanted and now we can continue to filter
that if we want, or start actually using it to investigate whatever it is we need to
investigate.

That'll do it for our discussions on filtering. And the unique thing about this
particular section was we dealt with unstructured data which is very different
than the last one where we looked at structured data.
Here we looked at Grok and how it can be used to provide structure to
unstructured data, and hopefully a way that isn't entirely too painful. It's a lot
easier than writing regular expressions, and a little bit more flexible in regards to
how you manage your time. At least it saves a little bit of time and is much better
than regular expressions for the most part.

You learned a little bit more of course about some other plugins some things a
couple new ones but also some ones we've already dealt with a little bit. So Grok
was the main one, you also got to use date again and we used the mutate plugin
in some new ways.

So once again, I’m going to ask you to take the exercises here go through those on
your own. Try to recreate what we've done here, but try to do it on your own.
Don't use the config file I’ve got that too for you for reference. Try to build it up
on your own. Do your own parsing and figure this out on your own. The more you
do this yourself the better you're going to get at it.

And especially with Grok get some practice building those patterns. And of course
we'll provide some great lab exercises for you to practice that with as well.

You might also like