Rapid Cybersecurity Ops
Rapid Cybersecurity Ops
Tutorials
Paul Troncone and Carl Albing
Offers & Deals
Highlights
Settings
Support
Sign Out
Playlists
Tutorials
Printed in the United States of America.
Highlights
O’Reilly books may be purchased for educational, business, or sales promotional use. Online
editions are also available for most titles (http://oreilly.com/safari). For more information,
Settings
contact our corporate/institutional sales department: 8009989938 or [email protected].
Support
Editor: Virginia Wilson
Sign Out
Production Editor: Justin Billing
Interior Designer: David Futato
Cover Designer: Karen Montgomery
Illustrator: Rebecca Demarest
May 2019: First Edition
20181009: First Release
20181127: Second Release
See http://oreilly.com/catalog/errata.csp?isbn=9781492041313 for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Rapid Cybersecurity Ops,
the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information
and instructions contained in this work are accurate, the publisher and the authors disclaim all
responsibility for errors or omissions, including without limitation responsibility for damages
resulting from the use of or reliance on this work. Use of the information and instructions
contained in this work is at your own risk. If any code samples or other technology this work
contains or describes is subject to open source licenses or the intellectual property rights of
others, it is your responsibility to ensure that your use thereof complies with such licenses
and/or rights.
9781492041313
[LSI]
laylists
opics
A NOTE FOR EARLY RELEASE READERS
utorials
This will be the fourth chapter of the final book. The preceding chapters will cover
foundational knowledge including the commandline interface and use of bash.
ffers & Deals
ighlights
Regular expressions (regex) are a powerful method for describing a text pattern to be matched
ettings
by various tools. There is only one place in bash where regular expressions are valid, using the
Support
=~ comparison in the [[ compound command, as in an if statement. However, regular
expressions are a crucial part of the larger toolkit for commands like grep, awk, and sed in
Sign Out
particular. They are very powerful and thus worth knowing. Once mastered, you’ll wonder how
you ever got along without them.
For many of the examples in this chapter we will be using the file frost.txt with its seven, yes
seven, lines of text.
Example 11. frost.txt
1 Two roads diverged in a yellow wood,
2 And sorry I could not travel both
3 And be one traveler, long I stood
4 And looked down one as far as I could
5 To where it bent in the undergrowth;
6
7 Excerpt from The Road Not Taken by Robert Frost
The content of Frost.txt will be used to demonstrate the power of regular expressions to process
text data. This text was chosen because it requires no prior knowledge to understand.
Commands in Use
We introduce the grep family of commands to demonstrate the basic regex patterns.
grep
The grep command searches the content of the files for a given pattern and prints any line
where the pattern is matched. To use grep, you need to provide it with a pattern and one or
more filenames (or piped data).
COMMON OPTIONS
c
Count the number of lines that match the pattern.
E
Enable extended regular expressions
f
Read the search pattern from a provided file. A file can contain more than one pattern, with
each line containing a single pattern.
i
Ignore character case.
l
Only print the file name and path where the pattern was found.
n
Print the line number of the file where the pattern was found.
P
Enables the Perl regular expression engine.
R, r
Recursively search subdirectories.
COMMAND EXAMPLE
In general, the way grep is used is like this: grep options pattern filenames
To search the /home directory and all subdirectories for files containing the word password
irrespective of uppercase/lowercase distinctions:
grep R i 'password' /home
The grep command supports some variations, notably an extended syntax for the regex
patterns (we’ll discuss the regex patterns next). There are three different ways to tell grep that
you want special meaning on certain characters: 1) by preceding those characters with a
backslash; or 2) by telling grep that you want the special syntax (without the need for
backslash) by using the E option when you invoke grep; or 3) by using the command named
egrep which is just a script that simply invokes grep as grep E so you don’t have to.
The only characters that are affected by the extended syntax are: ? + { | ( and ). In the
examples that follow we will use grep and egrep interchangeably they are the same binary
underneath. We will choose the one to use that seems most appropriate based on what special
characters we need. The special, or meta, characters are what make grep so powerful. Here is
what you need to know about the most powerful and frequently used metacharacters.
In regex, the “.” represents a single wildcard character. It will match on any single character
except for a newline. As can be seen in the example below, if we try to match on the pattern
T.o the first line of the frost.txt file is returned because it contains the word Two.
$ grep 'T.o' frost.txt
1 Two roads diverged in a yellow wood,
Note that line 5 is not returned even though it contains the word To. This pattern allows any
character to appear between the T and o, but as written there must be a character in between.
Regex patterns are also case sensitive, which is why line 3 of the file was not returned even
though it contains the string too. If you want to treat "." as a period character rather than a
wildcard, precede it with a backslash "\." to escape its special meaning.
In regex, the “?” character makes any item that precedes it optional; it matches it zero or one
time. By adding this metacharacter to the previous example we can see that the output is
different.
$ egrep 'T.?o' frost.txt
1 Two roads diverged in a yellow wood,
5 To where it bent in the undergrowth;
This time we see that both lines 1 and 5 are returned. This is because the metacharacter "." is
optional due to the "?" metacharacter that follows it. This pattern will match on any three
character sequence that begins with T and ends with o as well as the twocharacter sequence
To.
Notice that we are using egrep here. We could have used grep E or we could have used
“plain” grep with a slightly different pattern: T.\?o putting the backslash on the question
mark to give it the extended meaning.
In regex, the "*" is a special character that matches the preceding item zero or more times. It is
similar to the "?“, the main difference being that the previous item may appear more than once.
$ grep 'T.*o' frost.txt
1 Two roads diverged in a yellow wood,
5 To where it bent in the undergrowth;
7 Excerpt from The Road Not Taken by Robert Frost
The ".*" in the pattern above allows any number of any character to appear in between the T
and o. Thus the last line also matches because it contains the pattern The Ro.
$ egrep 'T.+o' frost.txt
1 Two roads diverged in a yellow wood,
5 To where it bent in the undergrowth;
7 Excerpt from The Road Not Taken by Robert Frost
The pattern above specifies one or more of any character to appear in between the T and o. The
first line of text matches because of Two the w is 1 character between the T and the o. The
second line doesn’t match the To, as in the previous example; rather, the pattern matches a
much larger string — all the way to the o in undergrowth. The last line also matches because
it contains the pattern The Ro.
Grouping
We can use parentheses to group together characters. Among other things, this allows us to treat
the characters appearing inside the parenthesis as a single item which we can later reference.
$ egrep 'And be one (stranger|traveler), long I stood' frost.txt
3 And be one traveler, long I stood
In the example above we use parenthesis and the Boolean OR operator "|" to create a pattern
that will match on line 3. Line 3 as written has the word traveler in it, but this pattern would
match even if traveler was replaced by the word stranger.
In regex the square brackets, [ ], are used to define character classes and lists of acceptable
characters. Using this construct you can list exactly which characters are matched at this
position in the pattern. This is particularly useful when trying to perform user input validation.
As a shorthand you can specify ranges with a dash such as [aj]. These ranges are in your
locale’s collating sequence and alphabet. For the C locale, the pattern [aj] will match one of
the letters a through j. Table 11 provides a list of common examples when using character
classes and ranges.
Table 11. Regex character ranges
Example Meaning
[abc] Match only the character a or b or c
[15] Match on digits in the range 1 to 5
[azAZ] Match any lowercase or uppercase a to z
[09+*/] Match on numbers or these 4 mathematical symbols
[09afAF] Match a hexadecimal digit
WARNING
Be careful when defining a range for digits; the range can at most go from 0 to 9.
For example, the pattern [1475] does not match on numbers between 1 and 475,
it matches on any one of the digits (characters) in the range 14 or the character 7 or
the character 5.
There are also predefined character classes known as shortcuts. These can be used to indicate
common character classes such as numbers or letters. See Table 12 for a list of shortcuts.
Table 12. Regex shortcuts
Shortcut Meaning
\s Whitespace
\S Not Whitespace
\d Digit
\D Not Digit
\w Word
\W Not Word
\x Hexadecimal Number (e.g. 0x5F)
Note that the above shortcuts are not supported by egrep. In order to use them you must use
grep with the P option. That option enables the Perl regular expression engine to support the
shortcuts. For example, to find any numbers in frost.txt:
$ grep P '\d' frost.txt
1 Two roads diverged in a yellow wood,
2 And sorry I could not travel both
3 And be one traveler, long I stood
4 And looked down one as far as I could
5 To where it bent in the undergrowth;
6
7 Excerpt from The Road Not Taken by Robert Frost
There are other character classes (with a more verbose syntax) that are valid only within the
bracket syntax, as seen in Table 13. They match a single character, so if you need to match
many in a row, use the star or plus to get the repetition you need.
Table 13. Regex character classes in brackets
Character Class Meaning
[:alnum:] any alphanumeric character
[:alpha:] any alphabetic character
[:cntrl:] any control character
[:digit:] any digit
[:graph:] any graphical character
[:lower:] any lowercase character
[:print:] any printable character
[:punct:] any punctuation
[:space:] any whitespace
[:upper:] any uppercase character
[:xdigit:] any hex digit
To use one of these classes it has to be inside the brackets, so you end up with two sets of
brackets. For example: grep '[[:cntrl:]]' large.data will look for lines containing
control characters (ASCII 025). Here is another example:
grep 'X[[:upper:][:digit:]]' idlist.txt
will match any line with an X followed by any uppercase letter or digit. It would match these
lines:
User: XTjohnson
an XWing model 7
an X7wing model
They each have an uppercase X followed immediately by either another uppercase letter or by a
digit.
Back References
Regex back references are one of the most powerful and often confusing regex operations.
Consider the following file, tags.txt:
1 Command
2 <i>line</i>
3 is
4 <div>great</div>
5 <u>!</u>
Suppose you want to write a regular expression that will extract any line that contains a
matching pair of complete HTML tags. The start tag has an HTML tag name; the ending tag has
the same tag name but with a leading slash. <div> and </div> are a matching pair. You could
search for these by writing a lengthy regex that contains all possible HTML tag values, or you
can focus on the format of an HTML tag and use a regex back reference.
$ egrep '<([AZaz]*)>.*</\1>' tags.txt
2 <i>line</i>
4 <div>great</div>
5 <u>!</u>
In this example, the back reference is the \1 appearing in the latter part of the regular
expression. It is referring back to the expression enclosed in first set of parentheses, [AZa
z]* which has two parts. The letter range in brackets denotes a choice of any letter, uppercase
or lowercase. The asterisk (or star) that follows it means to repeat that zero or more times.
Therefore the \1 refers to whatever was matched by that pattern in parentheses. If [AZaz]*
matches div then the \1 also refers to the pattern div.
The overall regular expression, then, can be described as matching a < sign (that literal character
is the first one in the regex) followed by zero or more letters then a > sign and then zero or more
of any character “.” for any character, “*” for zero or more of the previous item) followed by
another < and a slash and then the sequence matched by the expression within the parentheses
and finally a > character. If this sequence matches any part of a line from our text file then
egrep will print that line out.
You can have more than one back reference in an expression and refer to each with a \1 or \2
or \3 depending on its order in the regular expression. A \1 refers to the first set of
parentheses, \2 to the second, and so on. Note that the parentheses are metacharacters they
have a special meaning. If you just want to match a literal parenthesis you need to escape its
special meaning by preceding it with a backslash, as in: sin\([09.]*\) to match
expressions like: sin(6.2) or sin(3.14159).
NOTE
Valid HTML doesn’t have to be all on one line; the end tag can be several lines
away from the start tag. Moreover, some tags can both start and end in a single tag,
such as <br/> for a break, or <p/> for an empty paragraph. We would need a
more sophisticated approach to include such things in our search.
Quantifiers
Quantifiers specify the number of times an item must appear in a string. Quantifiers are defined
by the curly brackets { }. For example, the pattern T{5} means that the letter T must appear
consecutively exactly 5 times. The pattern T{3,6} means that the letter T must appear
consecutively 3 to 6 times. The pattern T{5,} means that the letter T must appear 5 or more
times.
You can use anchors to specify that a pattern must exist at the beginning or the end of a string.
The ^ character is used to anchor a pattern to the beginning of a string. For example ^[15]
means that a matching string must start with one of the digits 1 through 5 as the first character
on the line. The $ character is used to anchor a pattern to the end of a string or line. For example
[15]$ means that a string must end with one of the digits 1 through 5.
In addition, you can use \b to identify a word boundary (i.e., a space). The pattern \b[1
5]\b will match on any of the digits 1 through 5 where the digit appears as its own word.
Summary
Regular expressions are extremely powerful for describing patterns and can be used in
coordination with other tools to search and process data.
The uses and full syntax of regex far exceeds the scope of this book. You can visit the resources
below for additional information and utilities related to regex.
http://www.rexegg.com/
https://regex101.com
https://www.regextester.com/
http://www.regularexpressions.info/
In the next chapter we will discuss common data types relevant to security operations and how it
can be gathered.
Practice
After completing the exercises below you will be able to:
1. Use a regular expression to search a file
2. Use a regex back reference
3. Use regex ranges
4. Use regex groups
Exercises
1. Write a regular expression that matches a floating point number (a number with a decimal
point) such as 3.14. There can be digits on either side of the decimal point but there need
not be any on one side or the other. Allow it to match just a decimal point by itself, too.
2. Use a back reference in a regular expression to match a number that appears on both sides
of an equal sign. For example, it should match “314 is = to 314” but not “6 = 7”
3. Write a regular expression that looks for a line that begins with a digit and ends with a
digit, with anything occurring in between.
4. Write a regular expression that uses grouping to match on the following 2 IP addresses:
10.0.0.25 and 10.0.0.134.
5. Write a regular expression that will match if the hexadecimal string 0x90 occurs more than
3 times in a row (i.e. 0x90 0x90 0x90).
Chapter 2. Data Collection
History
Topics
Tutorials A NOTE FOR EARLY RELEASE READERS
This will be the fifth chapter of the final book.
Offers & Deals
Highlights
Data is the lifeblood of nearly every defensive security operation. Data tells you the current state
Settings
of the system, what has happened in the past, and even what might happen in the future. Data is
needed for forensic investigations, verifying compliance, and detecting malicious activity.
upport
Table 21 describes data that is commonly relevant to defensive operations and where it is
typically located.
Sign Out
Table 21. Data of Interest
Details on historical system
activity and state. Interesting
log files include web and In Linux most log files are located in the
Log Files DNS server logs, router, /var/log directory. In a Windows system
firewall, and intrusion logs are found in the Event Log.
detection system logs, and
application logs.
In Linux the location of the history file can
Command List of recently executed be found by executing echo $HISTFILE, and
History commands is typically located in the user’s home
directory in .bash_history.
In Windows, temp files can be found in
c:\windows\temp and
Various user and system files %USERPROFILE%\AppData\Local\. In Linux
Temporary
that were recently accessed, temp files are typically located in /tmp and
Files
saved, or processed /var/tmp. The Linux temporary directory
can also be found by using the command
echo $TMPDIR.
Documents, pictures, and User files are typically located in /home/ in
User Data
other user created files. Linux and c:\Users\ in Windows.
Hierarchical database that
Windows stores settings and other data
Registry that is critical to the operation Windows Registry
of Windows and applications
Throughout this chapter we will explore various methods to gather data, locally and remotely,
from both Linux and Windows systems.
Commands in Use
We introduce cut, file, head, and for Windows systems reg and wevtutil, to gather and
select data of interest from local and remote systems.
cut
cut is a command used to extract select portions of a file. It reads a supplied input file lineby
line and parses the line based on a specified delimiter. If no delimiter is specified cut will use a
TAB character by default. The delimiter characters divide each line of a file into fields. You can
use either the field number or character position number to extract parts of the file. Fields and
characters start at position 1.
c
Specify the character(s) to extract.
d
Specifies the character used as a field delimiter. By default delimiter is the TAB character.
f
Specify the field(s) to extract.
COMMAND EXAMPLE
Example 21. cutfile.txt
12/05/2017 192.168.10.14 test.html
12/30/2017 192.168.10.185 login.html
In cutfile.txt each field is delimited using a space. To extract the IP address (field position 2) you
can use the following command:
$ cut d' ' f2 file1.txt
192.168.10.14
192.168.10.185
The d’ ' option specifies the space as the field delimiter. The f2 option tells cut to return
the second field, in this case, the IP address.
WARNING
The cut command considers each delimiter character as separating a field. It
doesn’t collapse white space. Consider the following example:
Pat 25
Pete 12
If we use cut on this file we would define the delimiter to be a space. In the first
record there are 3 spaces between the name (Pat) and the number (25). Thus the
number is in field #4. However for the next line, the name (Pete) is in field #3, since
there are only two space characters between the name and the number. For a data
file like this, it would be better to separate the name from the numbers with a single
tab character and use that as the delimiter for cut.
file
The file command is used to help identify a given file’s type. This is particularly useful in
Linux as most files are not required to have an extension which can be used to identify its type
(c.f., .exe in Windows). The file command looks deeper than the filename by reading and
analyzing the first block of data, also known as the magic number. Even if you rename a .png
image file to end with a .jpg, the file command is smart enough to figure that out and tell
you the correct file type (in this case, a PNG image file).
f
Read the list of files to analyze from a given file
k
Do not stop on the first match, list all matches for the file type
z
Look inside compressed files
COMMAND EXAMPLE
To identify the file type just pass the filename to the file command.
$ file unknownfile
unknownfile: Microsoft Word 2007+
head
The head command displays the first few lines or bytes of a file. By default head displays the
first 10 lines.
n
Specify the number of lines to output. To show 15 lines you can specify it as n 15 or
15.
c
Specify the number of bytes to output.
reg
The reg command is used to manipulate the Windows Registry and is available in Windows
XP and later.
add
Adds an entry to the registry.
export
Copies the specified registry entries to a file.
query
Returns a list of subkeys below the specified path.
COMMAND EXAMPLE
To list the all of the root keys in the HKEY_LOCAL_MACHINE hive:
$ reg query HKEY_LOCAL_MACHINE
HKEY_LOCAL_MACHINE\BCD00000000
HKEY_LOCAL_MACHINE\HARDWARE
HKEY_LOCAL_MACHINE\SAM
HKEY_LOCAL_MACHINE\SECURITY
HKEY_LOCAL_MACHINE\SOFTWARE
HKEY_LOCAL_MACHINE\SYSTEM
wevtutil
Wevtutil is a command line utility to view and mange system logs in the Windows
environment. It is available in most modern versions of Windows and is callable from Git Bash.
el
Enumerate available logs
qe
Query a log’s events
/c
Specify the maximum number of events to read
/f
Format the output as text or XML
/rd
Read direction, if set to true it will read the most recent logs first
WARNING
In the Windows command prompt only a single / is needed before command
options. In the Git Bash terminal two // are needed (ex. //c) due to the way
commands are processed.
COMMAND EXAMPLE
To list all of the available logs:
wevtutil el
To view the most recent event in the System log using Git Bash:
wevtutil qe System //c:1 //rd:true
TIP
For additional information see Microsoft’s documentation at
https://docs.microsoft.com/enus/windowsserver/administration/windows
commands/wevtutil
The data you want may not always be available locally. You may need to connect to a remote
system such as a web, File Transfer Protocol (FTP), or Secure Shell (SSH) server to obtain the
desired data.
Commands can be executed remotely and securely using the Secure Shell (SSH) if the remote
system is running the SSH service. In its basic form (no options) you can just add ssh and a
hostname in front of any shell command to run that command on the specified host. For
example, ssh myserver who will run the who command on the remote machine
myserver. If you need to specify a different username: ssh username@myserver who
or ssh l username myserver who both do the same thing, just replace username
with the username for which you would like to use to login. You can redirect the output to a file
on your local system, or to a file on the remote system.
To run a command on a remote system and redirect the output to a file on your local system:
ssh myserver ps > /tmp/ps.out
To run a command on a remote system and redirect the output to a file on the remote system:
ssh myserver ps \> /tmp/ps.out
The backslash will escape the special meaning of the redirect (in the current shell) and simply
pass the redirect character as the second word of the three words sent to myserver. When
executed on the remote system it will be interpreted by that shell and redirect the output on the
remote machine (myserver) and leave it there.
In addition you can take scripts that reside on your local system and run them on a remote
system using SSH. To run the osdetect.sh script remotely:
ssh myserver bash < ./osdetect.sh
This runs the bash command on the remote system, but passes into it the lines of the
osdetect.sh script directly from your local system. This avoids the need for a twostep process
of, first, transferring the script to the remote system and then running that copied script. Output
from running the script comes back to your local system and can be captured by redirecting
stdout as we have shown with many other commands.
Log files for a Linux system are normally stored in the /var/log/ directory. To easily collect
the log files into a single file use the tar command:
tar czf ${HOSTNAME}_logs.tar.gz /var/log/
The option c is used to create an archive file, z to zip the file, and f to specify a name for
the output file. The HOSTNAME variable is a bash variable that is automatically set by the shell
to the name of the current host. We include it in our filename so the output file will be given the
same name as the system, which will help later with organization if logs are collected from
multiple systems. Note that you will need to be logged in as a privileged user or use sudo in
order to successfully copy the log files.
Table 22 lists some important and common Linux logs and their standard location.
Table 22. Linux Log Files
Log Location Description
/var/log/apache2/ Access and error logs for the Apache web server
/var/log/auth.log Information on user logins, privileged access, and remote authentication
/var/log/kern.log Kernal logs
/var/log/messages General noncritical system information
/var/log/syslog General system logs
To find more information on where log files are being stored for a given system refer to
/etc/syslog.conf or /etc/rsyslog.conf on most Linux distributions.
In the Windows environment wevtutil can be used to manipulate and gather log files.
Luckily this command is callable from Git Bash. The winlogs.sh script uses the wevtutil el
parameter to list all available logs, and then the epl parameter to export each log to a file.
Example 22. winlogs.sh
#!/bin/bash
#
# winlogs.sh collect logs on an MSWin system
#
# 1. Create a new directory called <system name>_logs
# 2. use wevtutil el to get a list of all of the possible logs
# 3. For each log,
# use wevtutil epl <log name> <current dir>\<system name>_<log name
>.evtx
# 4. (optional) Tar and zip the files and name it <system name>_logs.t
ar.gz
TGZ=0
if (( $# > 0 ))
then
if [[ ${1:0:2} == 'z' ]]
then
TGZ=1 # tgz flag to tar/zip the log files
fi
fi
SYSNAM=$(hostname)
LOGDIR=${1:/tmp/${SYSNAM}_logs}
mkdir p $LOGDIR
wevtutil el | while read ALOG
do
ALOG="${ALOG%$'\r'}"
echo "${ALOG}:"
wevtutil epl "$ALOG" "${LOGDIR}/${SYSNAM}_${ALOG// /_}.evtx"
done
if (( TGZ == 1 ))
then
cd ${LOGDIR} && tar czvf ${SYSNAM}_logs.tgz *.evtx
fi
The script begins with a simple initialization and then an if statement, one that checks to
see if any arguments were provided to the script. The $# is a special shell variable whose
value is the number of arguments supplied on the command line when this script is invoked.
This conditional for the if is an arithmetic expression, because of the double parentheses.
Therefore the comparison can use the greaterthan character > and it will do a numerical
comparison. If that symbol is used in an if expression with square brackets rather than
double parentheses, the greaterthan character > does a comparison of lexical ordering —
alphabetical order. You would need to use gt for a numerical comparison inside square
brackets.
For this script the only argument we are supporting is a z option to indicate that the log
files should all be zipped up into a single tar file when its done collecting log files. This also
means that we can use a simplistic type of argument parsing. We will use a more
sophisticated argument parser (getopts) in an upcoming script.
This check takes a substring of the 1st argument ($1) starting at the beginning of the string
(an offset of zero bytes), two bytes long. If the argument is, in fact, a z then we will set a
flag. The script also does a shift to remove that argument. What was the second
argument, if any, is now the first. The third, if any, becomes the second, and so on.
If the user wants to specify a location for the logs it can be specified as an argument to the
script. The optional z argument, if supplied, has already been shifted out of the way, so
any user supplied path would now be the first argument. If no value was supplied on the
command line then the expression inside the braces will return a default value as indicated to
the right of the minus sign. We use the braces around SYSTEM because the _logs would
otherwise be considered part of the variable name.
The p option to mkdir will create the directory and any intervening directories. It will
also not give an error message if the directory exists.
Here we invoke wevtutil el to list all the possible log files. The output is piped into a
while loop which will read one line, that is, one log filename, at a time.
Since this is running on a MSWindows system each line printed by wevtutil will end
with both a newline (\n) and a return (\r) character. We remove the character from the
right hand side of the string using the % operator. To specify the (nonprinting) return
character, we use the $'string' construct which substitutes certain backslashescaped
characters with nonprinting characters (as defined in the ANSI C standard). So the two
characters of \r are replaced with an ASCII 13 character, the return character.
We echo the filename to provide an indication to the user of progress being made and which
log is currently being fetched.
The fourth word on this line is the filename into which we want wevtutil to store the log
file it is producing. Since the name of the log as provided may have blanks we replace any
blank with an underscore character. While not strictly necessary, it avoids requiring quotes
when using the filename. The syntax, in general, is ${VAR/old/new} to retrieve the
value of VAR with a substitution: replacing old with new. Using a double slash,
${VAR//old/new} replaces all occurrences, not just the first.
WARNING
A common mistake is to type ${VAR/old/new/} but the trailing slash is not
part of the syntax and will simply be added to the resulting string if a
substitution is made. For example, if VAR=embolden then
${VAR/old/new/} would return embnew/en.
This is another arithmetic expression, enclosed in double parentheses. Within those
expressions bash doesn’t require the $ in front of most variable names. It would still be
needed for positional parameters like $1 to avoid confusion with the integer 1.
Here we separate two commands with a double ampersand && which tells the shell to
execute the second command only if the first command succeeds. That way the tar doesn’t
happen unless the cd is successful.
If you are able to arbitrarily execute commands on a system you can use standard OS commands
to collect a variety of information about the system. The exact commands you use will vary
based on the operating system you are interfacing with. Table 23 shows common commands
that can yield a great deal of information from a system. Note that the command may be
different depending on if it is run within the Linux or Windows environment.
Table 23. Local Data Gathering Commands
Linux Windows Git Bash
Purpose
Command Equivilent
cat
systeminfo Display system hardware and related info
/proc/cpuinfo
Display Address Resolution Protocol (ARP)
arp a arp a
table
The script getlocal.sh, below, is designed to identify the operating system type using
osdetect.sh, run the various commands appropriate for the operating system type, and record the
results to a file. The output from each command is stored in Extensible Markup Language
(XML) format, i.e., delimited with XML tags, for easier processing later on. Invoke the script
like this: bash getlocal.sh < cmds.txt where the file cmds.txt contains a list of
commands similar to that shown in Table 23. The format it expects are those fields, separated
by vertical bars, plus an additional field, the XML tag with which to mark the output of the
command. (Also, lines beginning with a # are considered comments and will be ignored.)
Here is what a cmds.txt file might look like:
# Linux Command |MSWin Bash |XML tag |Purpose
#+++
uname a |uname a |uname |O.S. version etc
cat /proc/cpuinfo|systeminfo |sysinfo |system hardware and related info
ifconfig |ipconfig |nwinterface|Network interface information
route |route print |nwroute |routing table
arp a |arp a |nwarp |ARP table
netstat a |netstat a |netstat |network connections
mount |net share |diskinfo |mounted disks
ps e |tasklist |processes |running processes
Here is the source for the script.
Example 23. getlocal.sh
#!/bin/bash
#
# getlocal.sh
# 1. Create a text file named after the system name (uname n)
# 2. Add a date and time to the file
# 3. Using our osdetect script to determine the OS type:
# 4. Run all of the commands in the table for the right OS
# 5. Append the output of each command to the file in XML format
# SepCmds separate the commands from the line of input
# ALINE = "linux cmd | mswindows cmd | tag | comment"
function SepCmds()
{
LCMD=${ALINE%%|*}
REST=${ALINE#*|}
WCMD=${REST%%|*}
REST=${REST#*|}
TAG=${REST%%|*}
if [[ $OSTYPE == "MSWin" ]]
then
CMD="$WCMD"
else
CMD="$LCMD"
fi
}
function DumpInfo ()
{
printf '<systeminfo host="%s" type="%s"' "$HOSTNAME" "$OSTYPE"
printf ' date="%s" time="%s">\n' "$(date '+%F')" "$(date '+%T')"
readarray CMDS
for ALINE in "${CMDS[@]}"
do
# ignore comments
if [[ ${ALINE:0:1} == '#' ]] ; then continue ; fi
SepCmds
if [[ ${CMD:0:3} == N/A ]]
then
continue
else
printf "<%s>\n" $TAG
$CMD
printf "</%s>\n" $TAG
fi
done
printf "</systeminfo>\n"
}
OSTYPE=$(./osdetect.sh)
HOSTNM=$(hostname)
TMPFILE="/tmp/${HOSTNM}.info"
# gather the info into the tmp file; errors, too
DumpInfo > $TMPFILE 2>&1
After the two function definitions the script begins here, invoking our osdetect.sh script
(from a previous chapter). We’ve specified the current directory as its location. You could
put it elsewhere but then be sure to change the specified path from ./ to wherever you put it
and/or add that location to your PATH variable.
NOTE
To make things more efficient you can include the code from osdetect.sh
directly in getlocal.sh.
Next we run the hostname program in a subshell to retrieve the name of this system for
use in the next line but also later in the DumpInfo function.
We use the hostname for the temporary filename where we will put all our output.
Here is where we invoke the function that will do most of the work of this script. We
redirect both stdout and stderr (to the same file) when invoking the function so that the
function doesn’t have to put redirects on any of its output statements; it can write to stdout
and this invocation will redirect all the output as needed.
Here is where the “guts” of the script begins. This function begins with some output of an
XML tag called <systeminfo> which will have it’s closing tag written out at the end of
this function.
The readarray command in bash will read all the lines of input (until endoffile or on
keyboard input until controlD). Each line will be its own entry in the array named, in this
case, CMDS.
This for loop will loop over the values of the CMDS array, that is, over each line, one at a
time.
This line uses the substring operation to take the character at position 0, of length 1, from the
variable ALINE. The hashtag (or pound sign) is in quotes so that the shell doesn’t interpret it
as the start of the script’s own comment.
If the line is not a comment, the script will call the SepCmds function. More about that
function later; it separates the line of input into CMD and TAG, where CMD will the
appropriate command for a Linux or MSWindows system depending on where we run the
script.
Here again we use the substring operation from the start of the string (position 0) of length 3
to look for the string that indications that there is no appropriate operation on this particular
operating system for the desired information. The continue statement tells bash to skip to
the next iteration of the loop.
If we do have an appropriate action to take, this section of code will print the specified XML
tag on either side of the invocation of the specified command. Notice that we just invoke the
command by retrieving the value of the variable CMD.
Here we isolate the Linux command from a line of our input file by removing all the
characters to the right of the vertical bar, including the bar itself. The %% says to make the
longest match possible on the right side of the variable’s value and remove it from the value
it returns (i.e., ALINE isn’t changed).
Here the # removes the shortest match and from the left hand side of the variable’s value.
Thus, it removes the Linux command that was just put in LCMD.
Again we remove everything to the right of the vertical bar but this time we are working
with REST, modified in the previous statement. This gives us the MSWindows command.
Here we extract the XML tag using the same substitution operations we’ve seen twice
already.
All that’s left in this function is the decision, based on the operating system type, as to which
value to return as the value in CMD. All variables are “global” unless explicitly declared as local
within a function. None of ours are local, so they can be used (set, changed, or used) throughout
the script.
When running this script you can use the cmds.txt file as shown or change its values to get
whatever set of information you want to collect. You can also run it without redirecting the
input from a file; simply type (or copy/paste) the input once the script is invoked.
The Windows Registry is a vast repository of settings that define how the system and
applications will behave. Specific registry key values can often be used to identify the presence
of malware and other intrusions. Because of that a copy of the registry is useful when later
performing analysis of the system.
To export the entire Windows Registry to a file:
regedit //E ${HOSTNAME}_reg.bak
Note that two forwardslashes are used before the E option because we are calling regedit
from Git Bash, only one would be needed if using the Windows command prompt. We use
${HOSTNAME} as part of the output file name to make it easier to organize later on.
If needed, the reg command can also be used to export sections of the registry or individual
subkeys. To export the HKEY_LOCAL_MACHINE hive:
reg export HKEY_LOCAL_MACHINE $(uname n)_hklm.bak
Searching the File System
The ability to search the system is critical for everything from organizing files, to incident
response, to forensic investigation. The find and grep commands are extremely powerful and
can be used to perform a variety of search functions.
Searching by Filename
Searching by filename is one of the most basic search methods. This is useful if the exact
filename is known, or a portion of the filename is known. To search the /home directory and
subdirectories for filenames containing the word password:
find /home name '*password*'
Note the use of the * character at the beginning and end of the search string designates a
wildcard, meaning it will match any (or no) characters. This is a shell pattern and is not the
same as a regular expression. Additionally you can use the iname option instead of name to
make the search caseinsensitive.
TIP
If you want to suppress errors, such as Permission Denied, when using find you
can do so by redirecting stderr to /dev/null or to a log file.
find /home name '*password*' 2>/dev/null
Hidden files are often interesting as they can be used by people or malware looking to avoid
detection. In Linux, names of hidden files begin with a period. To find hidden files in the
/home directory and subdirectories:
find /home name '.*'
TIP
The .* in the example above is a shell pattern which is not the same as a regular
expression. In the context of find the pattern provided will match on any file that
begins with a period and is followed by any number of additional characters
(denoted by the * wildcard character).
In Windows, hidden files are designated by a file attribute, not the filename. From the Windows
command prompt you can identify hidden files on the c:\ drive by:
dir c:\ /S /A:H
The /S option tells dir to recursively traverse subdirectories and the /A:H displays files with
the hidden attribute. Unfortunately Git Bash intercepts the dir command and instead executes
ls, which means it cannot easily be run from bash. This can be solved by using the find
command’s exec option coupled with the Windows attrib command.
The find command has the ability run a specified command for each file that is found. To do
that you can use the exec option after specifying your search criteria. Exec replaces any curly
brackets ({}) with the pathname of the file that was found. The semicolon terminates the
command expression.
$ find /c exec attrib '{}' \; | egrep '^.{4}H.*'
A H C:\Users\Bob\scripts\hist.txt
A HR C:\Users\Bob\scripts\winlogs.sh
The find command will execute the Windows attrib command for each file it identifies on
the c:\ drive (denoted as /c), thereby printing out each file’s attributes. The egrep command
is then used with a regular expression to identify lines where the 5th character is the letter H,
which will be true if the file’s hidden attribute is set.
If you want to clean up the output further and only display the file path you can do so by piping
the output of egrep into the cut command.
$ find . exec attrib '{}' \; | egrep '^.{4}H.*' | cut c22
C:\Users\Bob\scripts\hist.txt
C:\Users\Bob\scripts\winlogs.sh
The c option tells cut to use character position numbers for slicing. 22 tells cut to begin at
character 22, which is the beginning of the file path, and continue to the end of the line ().
This can be useful if you want to pipe the file path into another command for further processing.
The find command’s size option can be used to find files based on file size. This can be
useful to help identify unusually large files, or to identify the largest or smallest files on a
system.
To search for files greater than 5 GB in size in the /home directory and subdirectories:
find /home size +5G
To identify the largest files in the system you can combine find with a few other commands:
find / type f exec ls s '{}' \; | sort n r | head 5
First we use find / type f to list all of the files in and under the root directory. Each file
is passed to ls s which will identify its size in blocks (not bytes). The list is then sorted from
highest to lowest, and the top five are displayed using head. To see the smallest files in the
system tail can be used in place of head, or you can remove the reverse (r) option from
sort.
You can also use the ls command directly to find the largest file and completely eliminate the
usage of find, which, is significantly more efficient. To do that just add the R option for ls
which will cause it to recursively list the files under the specified directory.
ls / R s | sort n r | head 5
Searching by Time
The file system can also be searched based on when files were last accessed or modified. This
can be useful when investigating incidents to identify recent system activity. It can also be
useful for malware analysis to identify files that have been accessed or modified during program
execution.
To search for files in the /home directory and subdirectories modified less than 5 minutes ago:
find /home mmin 5
To search for files modified less than 24 hours ago:
find /home mtime 1
The number specified with the mtime option is a multiple of 24 hours, so 1 means 24 hours, 2
means 48 hours, etc. A negative number here means “less than” the number specified, a positive
number means “greater than”, and an unsigned number means “exactly”.
To search for files modified more than 2 days, i.e., 48 hours, ago:
find /home mtime +2
To search for files accessed less than 24 hours ago use the atime option:
find /home atime 1
To search for files in the /home directory accessed less than 24 hours ago and copy (cp) each
file to the current working directory (./):
find /home type f atime 1 exec cp '{}' ./ \;
The use of type f tells find to match only ordinary files, ignoring directories and other
special file types. You may also copy the files to any directory of your choosing by replacing
the ./ with an absolute or relative path.
WARNING
Warning: Be sure that your current working directory is not somewhere in the
/home hierarchy or you will have the copies found and thus copied again.
grep r i /home e 'password'
The r option recursively searches all directories below /home, i specifies a caseinsensitive
search, and e specifies the regex pattern string to search for.
TIP
The n option can be used identify which line in the file the search string is found
and w can be used to only match whole words.
You can combine grep with find to easily copy matching files to your current working
directory (or any specified directory):
find /home type f exec grep '{}' e 'password' \; exec cp '{}' ./ \;
First we use find /home/ type f to identify all of the files in and below the /home
directory. Each file found is passed to grep to search for password within its content. Each
file matching the grep criteria is then passed to the cp command to copy the file to the current
working directory (./). This combination of commands may take a considerable amount of
time to execute and is a good candidate to run as a background task.
Searching a system for specific file types can be challenging. You cannot rely on the file
extension, if one even exists, as that can be manipulated by the user. Thankfully the file
command can help identify types by comparing the contents of a file to known patterns called
Magic Numbers. Table 24 lists common Magic Numbers and their starting location inside of
files.
Table 24. Magic Numbers
DOS Executable 4D 5A MZ 0
Executable and Linkable
7F 45 4C 46 .ELF 0
Format
To begin you need to identify the type of file for which you want to search. Lets assume you
want to find all of the PNG image files on the system. First you would take a known good file
such as Title.png, run it through the file command, and examine the output.
$ file Title.png
Title.png: PNG image data, 366 x 84, 8bit/color RGBA, noninterlaced
As expected file identifies the known good Title.png file as PNG image data and also
provides the dimensions and various other attributes. Based on this information you need to
determine what part of the file command output to use for the search, and generate the
appropriate regular expression. In many cases, such as with forensic discovery, you are likely
better off gathering more information than less; you can always further filter the data later. To
do that you will use a very broad regular expression that will simply search for the word PNG in
the output from the file command.
.*PNG.*
You can of course make more advanced regular expressions to identify specific files. For
example, if you wanted to find PNG files that have a dimension of 100 x 100:
.*PNG.* 100 x 100.*
If you want to find PNG and JPEG files:
.*(PNG|JPEG).*
Once you have the regular expression you can write a script to run the file command against
every file on the system looking for a match. When a match is found typesearch.sh will
print the file path to standard out.
Example 24. typesearch.sh
# A bash script that will search the file system for a given file type
.
# It prints out the pathname (i.e., file name and location) when found
.
# The file type is defined by the "file" command; specify a reg. exp.
for
# a substring of file's output to define what you're looking for.
# Use the i option to ignore case (making your reg. exp. easier)
# Use the R option (or r) for recursion, otherwise it doesn't descen
d
# into subdirectories.
# There is also an option (c) to copy the found files
# to a specified location (i.e., directory)
# usage:
# typesearch.sh [c dir] [i] [R|r] <pattern> [ starting/path ]
# e.g., typesearch.sh Ri png
# will look recursively for all files of type PNG (or png or Png or ..
.)
DEEPORNOT="maxdepth 1" # just the current dir; default
# PARSE option arguments:
while getopts 'c:irR' opt; do
case "${opt}" in
c) # copy found files to specified directory
COPY=YES
DESTDIR="$OPTARG"
;;
i) # ignore u/l case differences in search
CASEMATCH='i'
;;
[Rr]) # recursive
unset DEEPORNOT;;
*) # unknown/unsupported option
# error mesg will come from getopts, so just exit
exit 2 ;;
esac
done
shift $((OPTIND 1))
PATTERN=${1:PDF document}
STARTDIR=${2:.} # by default start here
find $STARTDIR $DEEPORNOT type f | while read FN
do
file $FN | grep q $CASEMATCH "$PATTERN"
if (( $? == 0 )) # found one
then
echo $FN
if [[ $COPY ]]
then
cp p $FN $DESTDIR
fi
fi
done
This script supports options which alter its behavior, as described in the opening comments
of the script. The script needs to parse these options to tell which ones have been provided
and which are omitted. For anything more than a single option or two it makes sense to use
the getopts shell builtin. With the while loop we will keep calling getopts until it
returns a nonzero value, telling us that there are no more options. The options we want to
look for are provided in that string c:irR. Whichever option is found is returned in, opt,
the variable name we supplied.
We are using a case statement here which is a multiway branch; it will take the branch that
matches the pattern provided before the left parenthesis. We could have used an if/elif/else
construct but this reads well and makes the options so clearly visible.
The c option has a : after it in the list of supported options which indicates to getopts
that the user will also supply an argument for that option. For this script that optional
argument is the directory into which copies will be made. When getopts parses an option
with an argument like this it puts the argument in the variable named OPTARG and we save
it in DESTDIR because another call to getopts may change OPTARG.
The script supports either a upper case R or lower case r for this option. Case statements
specify a pattern to be matched, not just a simple literal, so we wrote [Rr]) for this case,
using the brackets construct to indicate that either letter is considered a match.
The other options set variables to cause their action to occur. In this case we unset the
previously set variable. When that variable is referenced later as $DEEPORNOT it will have
no value so it will effectively disappear from the command line where it is used.
Here is another pattern, the asterisk, which matches anything. If no other pattern has been
matched, this case will be executed. It is, in effect, an “else” clause for the case statement.
When we’re done parsing the options we can get rid of the ones we’ve already processed
with a shift. Just a single shift gets rid of a single argument so that the second
argument because the first, the third became the second, and so on. Specifying a number like
shift 5 will get rid of the first 5 arguments so that $6 becomes $1, $7 becomes $2, and
so on. Calls to getopts keep track of which arguments to process in the shell variable
OPTIND. It refers to the next argument to be processed. By shifting by this amount we get
rid of any/all of the options that we parsed. After this shift $1 will refer to the first non
option argument, whether or not any options were supplied when the user invoked the script.
The two possible arguments that aren’t option format are the pattern we’re searching for
and the directory where we want to start our search. When we refer to a bash variable we
can add a : to say “if that value is empty or unset then return this default value instead”.
We give a default value for PATTERN as PDF document and the default for STARTDIR
is . which refers to the current directory.
We invoke the find command telling it to start its search in $STARTDIR. Remember that
$DEEPORNOT may be unset and thus add nothing to the command line or it may be the
default maxdepth 1 telling find not to go any deeper than this directory. We’ve added a
type f so that we only find plain files (not directories or special device files or FIFOs).
That isn’t strictly necessary and you could remove it if you want to be able to search for
those kinds of files. The names of the files found are piped in to the while loop which will
read them one at a time into the variable FN.
The q option to grep tells it to be quiet and not output anything. We don’t need to see
what phrase it found, only that it found it.
The $? construct is the value returned by the previous command. A successful result means
that grep found the pattern supplied.
This checks to see if COPY has a value. If it is null the if will be false.
The p option to the cp command will preserve the mode, ownership and timestamps of the
file, in case that information is important to your analysis.
If you are looking for a lighter weight but less capable solution you can perform a similar search
using the find command’s exec option as seen in the example below.
find / type f exec file '{}' \; | egrep '.*PNG.*' | cut d' ' f1
Here we send each item found by the find command into file to identify its type. We then
pipe the output of file into egrep and filter it looking for the PNG keyword. The use of cut
is simply to clean up the output and make it more readable.
WARNING
Be cautious if using the file command on an untrusted system. The file
command uses the magic pattern file located at /usr/share/misc/. A
malicious user could modify this file such that certain file types would not be
identified. A better option is to mount the suspect drive to a knowngood system
and search from there.
A cryptographic hash function is a oneway function that transforms an input message of
arbitrary length into a fixed length message digest. Common hash algorithms include MD5,
SHA1, and SHA256. Take the following two files:
Example 25. hashfilea.txt
This is hash file A
Example 26. hashfileb.txt
This is hash file B
Notice that the files are identical except for the last letter in the sentence. You can use the
sha1sum command to compute the SHA1 message digest of each file.
$ sha1sum hashfilea.txt hashfileb.txt
6a07fe595f9b5b717ed7daf97b360ab231e7bbe8 *hashfilea.txt
2959e3362166c89b38d900661f5265226331782b *hashfileb..txt
Even though there was only a small difference between the two files they generated completely
different message digests. Had the files been the same the message digests would have also been
the same. You can use this property of hashing to search the system for a specific file if you
know its digest. The advantage is that the search will not be influenced by the filename,
location, or any other attributes; the disadvantage is that the files need to be exactly the same, if
the file contents have changed in any way the search will fail.
Example 27. hashsearch.sh
#
# This script will recursively search a given directory
# for a file that matches a given SHA1 hash.
# Output the full path of any file that matches.
#
# $0 hashtomatch [dir]
# $0 131341324134 .
# since default is '.', that's the same as:
# $0 131341324134
#
HASH=$1
DIR=${2:.} # default is here, cwd
# convert pathname into an absolute path
function mkabspath ()
{
if [[ $1 == /* ]]
then
ABS=$1
else
ABS="$PWD/$1"
fi
}
find $DIR type f |
while read fn
do
THISONE=$(sha1sum "$fn")
THISONE=${THISONE%% *}
if [[ $THISONE == $HASH ]]
then
mkabspath "$fn"
echo $ABS
fi
done
We’ll look for any plain file for our hash. We need to avoid special files reading a FIFO
would cause our program to hang as it waited for someone to write into the FIFO. Reading a
block special or character special file would also not be a good idea. The type f assures
that we only get plain files. It prints those filenames, one per line, to stdout which we
redirect via a pipe into the while read commands.
This computes the hash value in a subshell and captures its output (i.e., whatever it writes to
stdout) and assigns it to the variable. The quotes are needed in case the filename has spaces
in its name.
This reassignment removes from the right hand side the largest substring beginning with a
space. The output from sha1sum is both the computed hash and the filename. We only
want the hash value, so we remove the filename with this substitution.
We call the mkabspath function putting the filename in quotes. The quotes make sure that
the entire filename shows up as a single argument to the function, even if the filename has
one or more spaces in the name.
Remember that shell variables are global unless declared to be local within a function.
Therefore the value of ABS that was set in the call to mkabspath is available to us here.
This is our declaration of the function. When declaring a function you can omit either the
keyword function or the parentheses but not both.
For the comparison we are using shell pattern matching on the right hand side. This will
check to see if the first parameter begins with a slash. If it does, then this is already an
absolute pathname and we need do nothing further.
When the parameter is only a relative path, it is relative to the current location so we pre
pend the current working directory thereby making it absolute. The variable PWD is a shell
variable that is set to the current directory via the cd command.
Transferring Data
Once you have gathered all of the desired data, the next step is to move it off of the origin
system for further analysis. To do that you can copy the data to a removable device or upload it
to a centralized server. If you are going to upload the data be sure to do so using a secure
method such as Secure Copy (SCP). The example below uses scp to upload the file
some_system.tar.gz to the home directory of user bob on remote system 10.0.0.45.
scp some_system.tar.gz [email protected]:/home/bob/some_system.tar.gz
For convenience you can add a line at the end of your collection scripts to automatically use
scp to upload data to a specified host. Remember to give your files unique names as to not
overwrite existing files and also make analysis easier later on.
WARNING
Be cautious of how you perform SSH or SCP authentication within scripts. It is not
recommended that you include passwords in your scripts. The preferred method is
to use SSH certificates. The keys and certificates can be generated using the ssh
keygen command.
Summary
Gathering data is an important step in defensive security operations. When collecting data be
sure to transfer and store it using secure methods (i.e. encrypted). As a general rule, gather all
data that you think is relevant; you can easily delete data later, but you cannot analyze data you
did not collect. Before collecting data, first confirm you have permission and/or legal authority
to do so.
Also be aware that when dealing with adversaries, they will often try to hide their presence by
deleting or obfuscating data. To counter that be sure to use multiple methods when searching for
files (name, hash, contents, etc).
In the next chapter we will explore techniques for processing data and preparing it for analysis.
Practice
After completing the exercises below you will be able to:
Use command prompt utilities to gather important system information
Modify and enhance bash scripts
Transfer information using secure methods
Exercises
1. Write the command to search the file system for any file named dog.png.
2. Write the command to search the file system for any file containing the text
confidential.
3. Write the command to search the file system for any file containing the text secret or
confidential and copy the file to your current working directory.
4. Write the command to execute ls R / on the remote system 192.168.10.32 and
write the output to a file named filelist.txt on your local system.
5. Modify getlocal.sh to automatically upload the results to a specified server using
SCP.
6. Modify hashsearch.sh to have an option (1) to quit after finding a match. If the
option is not specified, it will keep searching for additional matches.
7. Modify hashsearch.sh to simplify the full pathname that it prints out.
a. If the string it output was /home/usr07/subdir/./misc/x.data modify it
to remove the redundant ./ before printing it out.
b. If the string was /home/usr/07/subdir/../misc/x.data modify it to
remove the ../ and also the subdir/ before printing it out.
8. Modify winlogs.sh to indicate its progress by printing the logfile name over the top of
the previous logfile name. (Hint: use a return character rather than a newline)
9. Modify winlogs.sh to show a simple progress bar of plus signs building from left to
right. Use a separate invocation of wevtutil el to get the count of the number of logs
and scale this to, say, a width of 60.
Chapter 3. Data Processing
History
Topics
Tutorials A NOTE TO EARLY RELEASE READERS
This will be the sixth chapter in the final book.
Offers & Deals
Highlights
In the previous chapter you gathered lots of data. Likely that data is in a variety of formats
Settings
including freeform text, comma separated values (CSV), and Extensible Markup Language
(XML). In this chapter we show you how to parse and manipulate that data so you can extract
Support
key elements for analysis.
Sign Out
Commands in Use
We introduce awk, join, sed, tail, and tr to prepare data for analysis.
awk
Awk is not just a command, but actually a programming language designed for processing text.
There are entire books dedicated to this subject. Awk will be explained in more detail
throughout this book, but here we provide just a brief example of its usage.
f
Read in the awk program from a specified file
COMMAND EXAMPLE
Take the file awkusers.txt:
Example 31. awkusers.txt
Mike Jones
John Smith
Kathy Jones
Jane Kennedy
Tim Scott
You can use awk to print each line where the user’s last name is Jones.
$ awk '$2 == "Jones" {print $0}' awkusers.txt
Mike Jones
Kathy Jones
Awk will iterate through each line of the input file reading in each word (separated by
whitespace by default) into fields. Field $0 represents the entire line, $1 the first word, $2 the
second word, etc. An awk program consists of patterns and corresponding code to be executed
when that pattern is matched. In this example there is only one pattern. We test $2 to see if that
field is equal to Jones. If it is, awk will run the code in the braces which, in this case, will
print the entire line.
NOTE
If we left off the explicit comparison and instead wrote awk ' /Jones/
{print $0}' then the string inside the slashes is a regular expression to match
anywhere in the input line. It would print all the names as before, but it would also
find lines where Jones might be the first name or part of a longer name (such as
“Jonestown”).
join
Join combines the lines of two files that share a common field. In order for join to function
properly the input files must be sorted.
j
Join using the specified field number. Fields start at 1.
t
Specify the character to use as the field separator. Space is the default field separator.
header
Use the first line of each file as a header.
COMMAND EXAMPLE
Take the following files:
Example 32. usernames.txt
1,jdoe
2,puser
3,jsmith
Example 33. accesstime.txt
0745,file1.txt,1
0830,file4.txt,2
0830,file5.txt,3
Both files share a common field of data, which is the user ID. In accesstime.txt the user ID is in
the third column. In usernames.txt the user ID is in the first column. You can merge these two
files using join as follows:
$ join 1 3 2 1 t, accesstime.txt usernames.txt
1,0745,file1.txt,jdoe
2,0830,file4.txt,puser
3,0830,file5.txt,jsmith
The 1 3 option tells join to use the third column in the first file (accesstime.txt), and 2 1
specifies the first column in the second file (usernames.txt) for use when merging the files. The
t, option specifies the comma character as the field delimiter.
sed
Sed allows you to perform edits, such as replacing characters, on a stream of data.
i
Edit the specified file and overwrite in place
COMMAND EXAMPLE
The sed command is quite powerful and can be used for a variety of functions, however,
replacing characters or sequences of characters is one of the most common. Take the file ips.txt:
Example 34. ips.txt
ip,OS
10.0.4.2,Windows 8
10.0.4.35,Ubuntu 16
10.0.4.107,macOS
10.0.4.145,macOS
You can use sed to replace all of the instances of the 10.0.4.35 IP address with
10.0.4.27.
$ sed 's/10\.0\.4\.35/10.0.4.27/g' ips.txt
ip,OS
10.0.4.2,Windows 8
10.0.4.27,Ubuntu 16
10.0.4.107,macOS
10.0.4.145,macOS
In this example, sed uses the following format with each component separated by a forward
slash:
s/<regular expression>/<replace with>/<flags/
The first part of the command (s) tells sed to substitute. The second part of the command
(10\.0\.4\.35) is a regular expression pattern. The third part (10.0.4.27) is the value to
use to replace the regex pattern matches. The forth part is optional flags, which in this case (g,
for global) tells sed to replace all instances on a line (not just the first) that match the regex
pattern.
tail
The tail command is used to output the last lines of a file. By default tail will output the
last 10 lines of a file.
Continuously monitor the file and output lines as they are added
n
Output the number lines specified
COMMAND EXAMPLE
To output the last line in the somefile.txt file:
$ tail n 1 somefile.txt
12/30/2017 192.168.10.185 login.html
tr
The tr command is used to translate or map from one character to another. It is also often used
to delete unwanted or extraneous characters. It only reads from stdin and writes to stdout so you
typically see it with redirects for the input and output files.
d
delete the specified characters from the input stream
s
squeeze, that is, replace repeated instances of a character with a single instance
COMMAND EXAMPLE
You can translate all the backslashes into forward slashes and all the colons to vertical bars with
the tr command:
tr '\\:' '/|' < infile.txt > outfile.txt
If the contents of infile.txt looked like this:
drive:path\name
c:\Users\Default\file.txt
then after running the tr command, outfile.txt would contain this:
drive|path/name
c|/Users/Default/file.txt
The characters from the first argument are mapped to the corresponding characters in the second
argument. Two backslashes are needed to specify a single backslash character because the
backslash has a special meaning to tr; it is used to indicate special characters line newline \n
or return \r or tab \t. You use the single quotes around the arguments to avoid any special
interpretation by bash.
TIP
Files from Windows systems often come with both a Carriage Return and a Line
Feed (CR & LF) character at the end of each line. Linux and macOS systems will
have only the newline character to end a line. If you transfer a file to Linux and
want to get rid of those extra return characters, here is how you might do that with
the tr command:
tr d '\r' < fileWind.txt > fileFixed.txt
Conversely, you can convert Linux line endings to Windows line endings using
sed:
$ sed i 's/$/\r/' fileLinux.txt
The i option makes the changes in place and writes them back to the input file.
"name","username","phone","password hash"
"John Smith","jsmith","5555551212",5f4dcc3b5aa765d61d8327deb882cf99
"Jane Smith","jnsmith","5555551234",e10adc3949ba59abbe56e057f20f883e
"Bill Jones","bjones","5555556789",d8578edf8458ce06fbc5bb76a58c5ca4
To extract just the name from the file you can use cut by specifying the field delimiter as a
comma and the field number you would like returned.
$ cut d',' f1 csvex.txt
"name"
"John Smith"
"Jane Smith"
"Bill Jones"
Note that the field values are still enclosed in double quotations. This may not be desirable for
certain applications. To remove the quotations you can simply pipe the output into tr with its
d option.
$ cut d',' f1 csvex.txt | tr d '"'
name
John Smith
Jane Smith
Bill Jones
You can further process the data by removing the field header using the tail command’s n
option.
$ cut d',' f1 csvex.txt | tr d '"' | tail n +2
John Smith
Jane Smith
Bill Jones
The n +2 option tells tail to output the contents of the file starting at line number 2, thus
removing the field header.
TIP
You can also give cut a list of fields to extract, such as f13 to extract fields 1
through 3, or a list such as f1,4 to extract fields 1 and 4.
While you can use cut to extract entire columns of data, there are instances where you will
want to process the file and extract fields linebyline; in this case you are better off using awk.
Let’s suppose you want to check each user’s password hash in csvex.txt against the dictionary
file of known passwords passwords.txt.
Example 36. csvex.txt
"name","username","phone","password hash"
"John Smith","jsmith","5555551212",5f4dcc3b5aa765d61d8327deb882cf99
"Jane Smith","jnsmith","5555551234",e10adc3949ba59abbe56e057f20f883e
"Bill Jones","bjones","5555556789",d8578edf8458ce06fbc5bb76a58c5ca4
Example 37. passwords.txt
password,md5hash
123456,e10adc3949ba59abbe56e057f20f883e
password,5f4dcc3b5aa765d61d8327deb882cf99
welcome,40be4e59b9a2a2b5dffb918c0e86b3d7
ninja,3899dcbab79f92af727c2190bbd8abc5
abc123,e99a18c428cb38d5f260853678922e03
123456789,25f9e794323b453885f5181f1b624d0b
12345678,25d55ad283aa400af464c76d713c07ad
sunshine,0571749e2ac330a7455809c6b0e7af90
princess,8afa847f50a716e64932d995c8e7435a
qwerty,d8578edf8458ce06fbc5bb76a58c5c
You can extract each user’s hash from csvex.txt using awk as follows:
$ awk F "," '{print $4}' csvex.txt
"password hash"
5f4dcc3b5aa765d61d8327deb882cf99
e10adc3949ba59abbe56e057f20f883e
d8578edf8458ce06fbc5bb76a58c5ca4
By default awk uses the space character as a field delimiter, so the F option is used to identify
a custom field delimiter (,) and then print out the forth field ($4) which is the password hash.
You can then use grep to take the output from awk one line at a time and search for it in the
passwords.txt dictionary file, outputting any matches.
$ grep "$(awk F "," '{print $4}' csvex.txt)" passwords.txt
123456,e10adc3949ba59abbe56e057f20f883e
password,5f4dcc3b5aa765d61d8327deb882cf99
qwerty,d8578edf8458ce06fbc5bb76a58c5ca4
If a file has fixedwidth field sizes you can use the cut command’s c option to extract data by
character position. In csvex.txt the (U.S. 10digit) phone number is an example of a fixedwidth
field.
$ cut d',' f3 csvex.txt | cut c213 | tail n +2
5555551212
5555551234
5555556789
Here you first use cut in delimited mode to extract the phone number at field 3. Since each
phone number is the same number of characters you can use the cut character position option
(c) to extract the characters in between the quotations. Finally, tail is used to remove the
file header.
Processing XML
Extensible Markup Language (XML) allows you to arbitrarily create tags and elements that
describe data. Below is an example XML document.
Example 38. book.xml
<1> This is a start tag that contains two attributes, also known as name/value pairs. Attribute
values must always be quoted.
<2> This is a start tag.
<3> This is an element that has content.
<4> This is an end tag.
For useful processing, you must be able to search through the XML and extract data from within
the tags, which can be done using grep. Lets find all of the firstName elements. The o
option is used so only the text that matches the regex pattern will be returned, rather than the
entire line.
$ grep o '<firstName>.*<\/firstName>' book.xml
<firstName>Paul</firstName>
<firstName>Carl</firstName>
Note that the regex pattern above will only find the XML element if the start and end tags are on
the same line. To find the pattern across multiple lines you need to make use of two special
features. First, add the z option to grep, which treats newlines like any ordinary character in
its searching and adds a null (ASCII 0) at the end of each string it finds. Then add the P option
and (?s) to the regex pattern, which is a Perlspecific pattern match modifier. It modifies the .
metacharacter to also match on the newline character.
$ grep Pzo '(?s)<author>.*?<\/author>' book.xml
<author>
<firstName>Paul</firstName>
<lastName>Troncone</lastName>
</author><author>
<firstName>Carl</firstName>
<lastName>Albing</lastName>
</author>
WARNING
The P option is not available for all versions of grep including those included
with macOS.
To strip the XML start and end tags and extract the content you can pipe your output into sed.
$ grep Po '<firstName>.*?<\/firstName>' book.xml | sed 's/<[^>]*>//g'
Paul
Carl
The sed expression can be described as s/expr/other/ to replace (or substitute) some
expression (expr) with something else (other). The expression can be just literal characters
or a more complex regex. If an expression has no “other” portion, such as s/expr// then it
replaces anything that matches the regular expression with nothing, essentially removing it. The
regex pattern we use in the above example, namely the <[^>]*> expression, is a little
confusing, so lets break it down.
< The pattern begins with a literal lessthan character <
[^>]* Zero or more (indicated by the asterisk) characters from the set of characters inside the
brackets; the first character is a ^ which means “not” any of the remaining characters listed.
Here that’s just the solitary greaterthan character, so [^>] matches any character that is not >
> The pattern ends with a literal >
This should match a single XML tag, from its opening lessthan to its closing greaterthan
character, but not more than that.
Processing JSON
JavaScript Object Notation (JSON) is another popular file format, particularly for exchanging
data through Application Programming Interfaces (APIs). JSON is a simple format that consists
of objects, arrays, and name/value pairs. Here is a sample JSON file:
Example 39. book.json
{
"title": "Rapid Cybersecurity Ops",
"edition": 1,
"authors": [
{
"firstName": "Paul",
"lastName": "Troncone"
},
{
"firstName": "Carl",
"lastName": "Albing"
}
]
}
<1> This is an object. Objects begin with { and end with }.
<2> This is a name/value pair. Values can be a string, number, array, boolean, or null.
<3> This is an array. Arrays begin with [ and end with ].
TIP
For more information on the JSON format visit http://json.org/
When processing JSON you are likely going to want to extract key/value pairs. To do that you
can use grep. Lets extract the firstName key/value pair from book.json.
$ grep o '"firstName": ".*"' book.json
"firstName": "Paul"
"firstName": "Carl"
Again, the o option is used to return only the characters that match the pattern rather than the
entire line of the file.
If you want to remove the key and only display the value you can do so by piping the output
into cut, extracting the second field, and removing the quotations with tr.
$ grep o '"firstName": ".*"' book.json | cut d " " f2 | tr d '\"'
Paul
Carl
We will perform more advanced processing of JSON in Chapter 11.
JQ
jq is a lightweight language and JSON parser for the Linux command line. It is very
powerful, but it is not installed by default on most versions of Linux.
To get the title key in book.json using jq:
$ jq '.title' book.json
"Rapid Cybersecurity Ops"
To list the first name of all of the authors:
$ jq '.authors[].firstName' book.json
"Paul"
"Carl"
Because authors is a JSON array, you need to use [] when accessing it. To access a
specific element of the array use the index, starting at position 0 ([0] to access the first
element of the array). To access all items in the array use [] with no index.
For more information on jq visit https://stedolan.github.io/jq.
Aggregating Data
Data is often collected from a variety of sources, and in a variety of files and formats. Before
you can analyze the data you must get it all into the same place and in a format that is conducive
to analysis.
Suppose you want to search a treasure trove of data files for any system named
ProductionWebServer. Recall that in previous scripts we wrapped our collected data in
XML tags with the following format: '<systeminfo host="">. During collection we also
named our files using the host name. You can now use either of those attributes to find and
aggregate the data into a single location.
find /data type f exec grep '{}' e 'ProductionWebServer' \;
exec cat '{}' >> ProductionWebServerAgg.txt \;
The command find /data type f lists all of the files in the /data directory and its
subdirectories. For each file found, it runs grep looking for the string
ProductionWebServer. If found, the file is appended (>>) to the file
ProductionWebServerAgg.txt. Replace the cat command with cp and a directory location if
you would rather copy all of the files to a single location rather than to a single file.
You can also use the join command to take data that is spread across two files and aggregate it
into one. Take the two files seen in Example 310 and Example 311.
Example 310. ips.txt
ip,OS
10.0.4.2,Windows 8
10.0.4.35,Ubuntu 16
10.0.4.107,macOS
10.0.4.145,macOS
Example 311. user.txt
user,ip
jdoe,10.0.4.2
jsmith,10.0.4.35
msmith,10.0.4.107
tjones,10.0.4.145
The files share a common column of data, which is the IP addresses. Because of that the files
can be merged using join.
$ join t, 2 2 ips.txt user.txt
ip,OS,user
10.0.4.2,Windows 8,jdoe
10.0.4.35,Ubuntu 16,jsmith
10.0.4.107,macOS,msmith
10.0.4.145,macOS,tjones
The t, option tells join that the columns are delimited using a comma, by default it uses a
space character.
The 2 2 option tells join to use the second column of data in the second file (user.txt) as the
key to perform the merge. By default join uses the first field as the key, which is appropriate
for the first file (ips.txt). If you needed to join using a different field in ips.txt you would just
add the option 1 n where n is replaced by the appropriate column number.
WARNING
In order to use join both files must already be sorted by the column you will use
to perform the merge. To do this you can use the sort command which is covered
in Chapter 7.
Summary
In this chapter we explored ways to process common data formats including delimited,
positional, JSON, and XML. The vast majority of data you collect and process will be in one of
those formats.
In the next chapter we will look at how data can be analyzed and transformed into information
that will provide insights into system status and drive decision making.
Practice
After completing the exercises below you will be able to:
Extract columns of data from a file
Merge two files based on a common field
Replace characters in a file
Process JSON formatted data
Exercises
1. Given the file tasks.txt below, use the cut command to extract columns 1 (Image Name),
2 (PID), and 5 (Mem Usage).
Image Name;PID;Session Name;Session#;Mem Usage
System Idle Process;0;Services;0;4 K
System;4;Services;0;2,140 K
smss.exe;340;Services;0;1,060 K
csrss.exe;528;Services;0;4,756 K
2. Given the file procowner.txt below, use the join command to merge the file with
tasks.txt.
Process Owner;PID
jdoe;0
tjones;4
jsmith;340
msmith;528
3. Use the tr command to replace all of the semicolon characters in tasks.txt with the tab
character and print it to the screen.
4. Write a command that extracts the first and last names of all of the authors in book.json.
Chapter 4. Data Analysis
History
Topics
In the previous chapters we used scripts to collect data and prepare it for analysis. Now we need
to make sense of it all. When analyzing large amounts of data it often helps to start broad and
Tutorials
continually narrow the search as new insights are gained into the data.
Settings
We will use an Apache web server access log for for most of the examples in this chapter. This
type of log records page requests made to the web server, when they were made, and who made
Support
them. A sample of a typical log entry can be seen below. The full log file will be referenced as
access.log in this book and can be downloaded at https://www.rapidcyberops.com.
Sign Out
Example 41. Sample from access.log
192.168.0.11 [12/Nov/2017:15:54:39 0500] "GET /requestquote.html HTTP/1.1"
7326 "http://192.168.0.35/support.html" "Mozilla/5.0 (Windows NT 6.3; Win64; x64
Gecko/20100101 Firefox/56.0"
NOTE
Web server logs are used simply as an example. The techniques introduced
throughout this chapter can be applied to analyze a variety of data types.
The Apache web server log fields are broken out in Table 41.
Table 41. Apache Web Server Combined Log Format Fields
Field
Field Description
Number
IP address of the host that requested the
192.168.0.11 1
page
RFC 1413 Ident protocol identifier ( if
2
not present)
The HTTP authenticated user ID ( if not
3
present)
HTTP/1.1 The HTTP protocol version 8
200 The status code returned by the web server 9
7326 The size of the file returned in bytes 10
http://192.168.0.35/support.html The referring page 11
Mozilla/5.0 (Windows NT 6.3;
User agent identifying the browser 12+
Win64…
Note that there is a second type of Apache access log known as the Common Log Format. The
format is the same as the Combined Log Format except it does not contain fields for the
referring page or user agent. See https://httpd.apache.org/docs/2.4/logs.html for additional
information on the Apache log format and configuration.
The Hypertext Transfer Protocol (HTTP) status codes mentioned above are often very
informational and let you know how the web server responded to any given request. Common
codes are seen in Table 42:
Table 42. HTTP Status Codes
Code Description
200 OK
401 Unauthorized
404 Page Not Found
500 Internal Server Error
502 Bad Gateway
TIP
For a complete list of codes see the Hypertext Transfer Protocol (HTTP) Status
Code Registry at https://www.iana.org/assignments/httpstatuscodes
Commands in use
We introduce sort, head, and uniq to limit the data we need to process and display. The
following file will be used for command examples:
Example 42. file1.txt
12/05/2017 192.168.10.14 test.html
12/30/2017 192.168.10.185 login.html
sort
The sort command is used to rearrange a text file into numerical and alphabetical order. By
default sort will arrange lines in ascending order starting with numbers and then letters.
Uppercase letters will be placed before their corresponding lowercase letter unless otherwise
specified.
r
Sort in descending order
f
Ignore case
n
Use numerical ordering, so that 1,2,3 all sort before 10. (in the default alphabetic sorting, 2
and 3 would appear after 10.
k
Sort based on a subset of the data (key) in a line. Fields are delimited by whitespace.
o
Write output to a specified file.
COMMAND EXAMPLE
To sort file1.txt by the file name column and ignore the IP address column you would use the
following:
sort k 2 file1.txt
You can also sort on a subset of the field. To sort by the 2nd octet in the IP address:
sort k 1.5,1.7 file1.txt
This will sort using characters 5 through 7 of the first field.
uniq
The uniq command filters out duplicate lines of data that occur adjacent to one another. To
remove all duplicate lines in a file be sure to sort it before using uniq.
c
Print out the number of times a line is repeated.
f
Ignore the specified number of fields before comparing. For example, f 3 will ignore the
first three fields in each line. Fields are delimited using spaces.
i
Ignore letter case. By default uniq is casesensitive.
To do that you can use the sort, head, and tail commands at the end of a pipeline such as:
… | sort k 2.1 rn | head 15
which pipes the output of a script into the sort command and then pipes that sorted output into
head that will print the top 15 (in this case) lines. The sort command here is using as its sort
key (k) the second field beginning at its first character (2.1). Moreover, it is doing a reverse
sort (r) and the values will be sorted like numbers (n). Why a numerical sort? so that 2
shows up between 1 and 3 and not between 19 and 20 (which is alphabetical order).
By using head we take the first lines of the output. We could get the last few lines by piping
the output from the sort command into tail instead of head. Using tail 15 would give
us the last 15 lines. The other way to do this would be to simply remove the r option on sort
so that it does an ascending rather than descending sort.
A high number of requests returning the 404 (Page Not Found) status code for a specific
page; this can indicate broken hyperlinks.
A high number of requests from a single IP address returning the 404 status code; this can
indicate probing activity looking for hidden or unlinked pages.
A high number of requests returning the 401 (Unauthorized) status code, particularly from
the same IP address; this can indicate an attempt at bypassing authentication, such as
bruteforce password guessing.
To detect this type of activity we need to be able to extract key fields, such as the source IP
address, and count the number of times they appear in a file. To accomplish this we will use the
cut command to extract the field and then pipe the output into our new tool countem.sh.
Example 43. countem.sh
#!/bin/bash
# count the number of instances of an item
# using bash
declare A cnt # assoc. array
while read id xtra
do
let cnt[$id]++
done
# now display what we counted
# for each key in the (key, value) assoc. array
for id in "${!cnt[@]}"
do
printf '%d %s\n' "${cnt[$id]}" "$id"
done
And here is another version, this time using awk:
Example 44. countem.awk
# count the number of instances of an item
# using awk and its associative arrays
awk '{ cnt[$1]++ }
END { for (id in cnt) {
printf "%d %s\n", cnt[id], id
}
}'
Since we don’t know what IP addresses (or other strings) we might encounter, we will use
an associative array, declared here with the A option, so that we can use whatever string
we read as our index.
The associative array feature of bash found in bash 4.0 and higher. In such an array, the
index doesn’t have to be a number but can be any string. So you can index the array by the
IP address and thus count the occurrences of that IP address. In case you’re using something
older than bash 4.0, Example 44 is an alternate script that uses awk instead.
The array references are like others in bash, using the ${var[index]} syntax to
reference an element of the array. To get all the different index values that have been used
(the “keys” if you think of these arrays as (key, value) pairings), use: ${!cnt[@]}
While we only expect one word of input per line, we put the variable xtra there to capture
any other words that appear on the line. Each variable on a read command gets a assigned
the corresponding word from the input (i.e., the first variable gets the first word, the second
variable get the second word, and so on), but the last variable gets any and all remaining
words. On the other hand, if there are fewer words of input on a line than their are variables
on the read command, then those extra variables get set to the empty string. So for our
purposes, if there are extra words on the input line, they’ll all be assigned to xtra but if
there are no extra words then xtra will be given the value of the null string (which won’t
matter either way because we don’t use it.)
Here we use that string as the index and increment its previous value. For the first use of the
index, the previous value will be unset, which will be taken as zero.
This syntax lets us iterate over all the various index values that we encountered. Note,
however, that the order is not guaranteed it has to do with the hashing algorithm for the
index values, so it is not guaranteed to be in any order such as alphabetical order.
In printing out the value and key we put the values inside quotes so that we always get a
single value for each argument even if that value had a space or two inside it. It isn’t
expected to happen with our use of this script, but such coding practices make the scripts
more robust when used in other situations.
Both will work nicely in a pipeline of commands like this:
cut d' ' f1 logfile | bash countem.sh
or (see note 2 above) just:
bash countem.sh < logfile
For example, to count the number of times an IP address made a HTTP request that resulted in a
404 (page not found) error:
$ awk '$9 == 404 {print $1}' access.log | bash countem.sh
1 192.168.0.36
2 192.168.0.37
1 192.168.0.11
You can also use grep 404 access.log and pipe it into countem.sh, but that would
include lines where 404 appears in other places (e.g. the byte count, or part of a file path). The
use of awk here restricts the counting only to lines where the returned status (the ninth field) is
404. It then prints just the IP address (field 1) and pipes the output into countem.sh to get the
total number of times each IP address made a request that resulted in a 404 error.
To begin analysis of the example access.log file you can start by looking at the hosts that
accessed the web server. You can use the Linux cut command to extract the first field of the
log file, which contains the source IP address, and then pipe the output into the countem.sh
script. The exact command and output is seen below.
$ cut d' ' f1 access.log | bash countem.sh | sort rn
111 192.168.0.37
55 192.168.0.36
51 192.168.0.11
42 192.168.0.14
28 192.168.0.26
TIP
If you do not have countem.sh available you can use the uniq command c option
to achieve similar results, but it will require an extra pass through the data using
sort to work properly.
$ cut d' ' f1 access.log | sort | uniq c | sort rn
111 192.168.0.37
55 192.168.0.36
51 192.168.0.11
42 192.168.0.14
28 192.168.0.26
Next, you can further investigate by looking at the host that had the most number of requests,
which as can be seen above is IP address 192.168.0.37 with 111. You can use awk to filter
on the IP address, then pipe that into cut to extract the field that contains the request, and
finally pipe that output into countem.sh to provide the total number of requests for each page.
$ awk '$1 == "192.168.0.37" {print $0}' access.log | cut d' ' f7 | bash counte
1 /uploads/2/9/1/4/29147191/31549414299.png?457
14 /files/theme/mobile49c2.js?1490908488
1 /cdn2.editmysite.com/images/editor/themebackground/stock/iPad.html
1 /uploads/2/9/1/4/29147191/2992005_orig.jpg
. . .
14 /files/theme/custom49c2.js?1490908488
The activity of this particular host is unimpressive, appearing to be standard web browsing
behavior. If you take a look at the host with the next highest number of requests, you will see
something a little more interesting.
$ awk '$1 == "192.168.0.36" {print $0}' access.log | cut d' ' f7 | bash counte
1 /files/theme/mobile49c2.js?1490908488
1 /uploads/2/9/1/4/29147191/31549414299.png?457
1 /_/cdn2.editmysite.com/.../Coffee.html
1 /_/cdn2.editmysite.com/.../iPad.html
. . .
1 /uploads/2/9/1/4/29147191/601239_orig.png
This output indicates that host 192.168.0.36 accessed nearly every page on the website
exactly one time. This type of activity often indicates webcrawler or site cloning activity. If you
take a look at the user agent string provided by the client it further verifies this conclusion.
$ awk '$1 == "192.168.0.36" {print $0}' access.log | cut d' ' f1217 | uniq
"Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)
The user agent identifies itself as HTTrack, which is a tool used to download or clone
websites. While not necessarily malicious, it is interesting to note during analysis.
TIP
You can find additional information on HTTrack at http://www.httrack.com.
The solution is not that much different than countem.sh you just need a few small changes.
First, you need more columns of data by tweaking the input filter (the cut command) to extract
two columns (IP address and byte count) rather than just IP address. Second, you will change
the calculation from an increment, (let cnt[$id]+\+) a simple count, to be a summing of that
second field of data (let cnt[$id]+=$data).
The pipeline to invoke this will now extract two fields from the logfile, the first and the last.
cut d' ' f 1,10 access.log | bash summer.sh
Example 45. summer.sh
# sum the total of field 2 values for each unique field 1
# using bash
declare A cnt # assoc. array
{ while read id count
do
let cnt[$id]+=$count
done
for id in "${!cnt[@]}"
do
printf "%15s %8d\n" "${id}" "${cnt[${id}]}"
done ; }
Note that we’ve made a few other changes to the output format. With the output format,
we’ve added field sizes of 15 characters for the first string (the IP address in our sample
data), left justified (via the minus sign) and 8 digits for the sum values. If the sum is larger,
it will print the larger number, and if the string is longer, it will be printed in full. We’ve
done this to get the data to align, by and large, nicely in columns, for readability.
You can run summer.sh against the example access.log file to get an idea of the total amount of
data requested by each host. To do this use cut to extract the IP address and bytes transferred
fields, and then pipe the output into summer.sh.
$ cut d' ' f1,10 access.log | bash summer.sh | sort k 2.1 rn
192.168.0.36 4371198
192.168.0.37 2575030
192.168.0.11 2537662
192.168.0.14 2876088
192.168.0.26 665693
These results can be useful in identifying hosts that have transferred unusually large amounts of
data compared to other hosts. A spike could indicate data theft and exfiltration. If you identify
such a host the next step would be to review the specific pages and files accessed by the
suspicious host to try and classify it as malicious or benign.
The script to do the printing will take the first field as the index to an associative array; the
second field as the value for that array element. It will then iterate through the array and print a
number of hashtags to represent the count, scaled to 50 # symbols for the largest count in the
list.
Example 46. histogram.sh
#!/bin/bash
#
# histogram.sh horizontal bar charts
#
# input: label value
# label value
# label value
#
# output:
# label ######
# label ###
# label ########
#
function pr_bar ()
{
local i i raw maxraw scaled
raw=$1
maxraw=$2
((scaled=(MAXBAR*raw)/maxraw))
# min size guarantee
((raw > 0 && scaled == 0)) && scaled=1
for((i=0; i<scaled; i++)) ; do printf '#' ; done
printf '\n'
} # pr_bar
#
# "main"
#
declare A RA
declare i MAXBAR max
max=0
MAXBAR=50 # how large the largest bar should be
while read labl val
do
let RA[$labl]=$val
# keep the largest value; for scaling
(( val > max )) && max=$val
done
# scale and print it
for labl in "${!RA[@]}"
do
printf '%20.20s ' "$labl"
pr_bar ${RA[$labl]} $max
done
We define a function to draw a single bar of the histogram. This definition must be
encountered before a call to the function can be made, so it makes sense to put function
definitions at the front of our script. We will be reusing this function in a future script so we
could have put it in a separate file and included it here with a source command but we
didn’t.
We declare all these variables as local because we don’t want them to interfere with variable
names in the rest of this script (or any others, if we copy/paste this script to use elsewhere).
We declare all these variables as integer (that’s the i option) because we are only going to
compute values with them and not use them as strings.
The computation is done inside doubleparentheses and inside those we don’t need to use
the $ to indicate “the value of” each variable name.
This is an “ifless” if statement. If the expression inside the doubleparentheses is true
then, and only then, is the second expression (the assignment) executed. This will guarantee
that scaled is never zero when the raw value is nonzero. Why? Because we’d like
something to show up in that case.
The main part of the script begins with a declaration of the RA array as an associative array.
Here we reference the associative array using the label, a string, as its index.
Since the array isn’t index by numbers, we can’t just count integers and use them as indices.
This contruct gives all the various strings that were used as an index to the array, one at a
time, in the for loop.
We use the label as an index one more time to get the count and pass it as the first parameter
to our pr_bar function.
Note that the items don’t appear in the same order as the input. That’s because the hashing
algorithm for the key (the index) doesn’t preserve ordering. You could take this output and pipe
it into yet another sort, or you could take a slightly different approach.
Here’s a version of the histogram script that preserves order by not using an associative array.
This might also be useful on older versions of bash (pre 4.0), prior to the introduction of
associative arrays. Only the “main” part of the script is shown as the function pr_bar remains
the same.
Example 47. histogram_plain.sh
#
# "main" this version uses plain arrays
#
declare a RA_key RA_val
declare i max ndx
max=0
maxbar=50 # how large the largest bar should be
ndx=0
while read labl val
do
RA_key[$ndx]=$labl
RA_value[$ndx]=$val
# keep the largest value; for scaling
(( val > max )) && max=$val
let ndx++
done
# scale and print it
for ((j=0; j<ndx; j++))
do
printf "%20.20s " ${RA_key[$j]}
pr_bar ${RA_value[$j]} $max
done
This version of the script avoids the use of associative arrays in case you are running an older
version of bash (prior to 4.x), such as on MacOS systems. For this version we use two separate
arrays, one for the index value and one for the counts. Since they are normal arrays we have to
use an integer index and so we will keep a simple count in the variable ndx.
Here the variable names are declared as arrays. The lowercase a says that they are arrays,
but not of the “associative” variety. While not strictly necessary, it is good practice.
The key and value pairs are stored in separate arrays, but at the same index location. This
approach is “brittle” that is, easily broken, if changes to the script ever got the two arrays
out of sync.
Now the for loop, unlike the previous script, is a simple counting of an integer from 0 to
ndx. The variable j is used here so as not to interfere with the index in the for looop
inside pr_bar although we were careful enough inside the function to declare its version of
i as local to the function. Do you trust it? Change the j to an i here and see if it still works
(It does). Then try removing the local declaration and see if it fails (It does).
This approach with the two arrays does have one advantage. By using the numerical index for
storing the label and the data you can retrieve them in order they were read in in the numerical
order of the index.
You can now visually see the hosts that transferred the largest number of bytes by extracting the
appropriate fields from access.log, piping the results into summer.sh and then into histogram.sh.
$ cut d' ' f1,10 access.log | bash summer.sh | bash histogram.sh
192.168.0.36 ##################################################
192.168.0.37 #############################
192.168.0.11 #############################
192.168.0.14 ################################
192.168.0.26 #######
While this might not seem that useful for the small amount of sample data, being able to
visualize trends is invaluable when looking across larger datasets.
In addition to looking at the number of bytes transferred by IP address or host, it is often
interesting to look at the data by date and time. To do that you can use the summer.sh script, but
due to the format of the access.log file you need to do a little more processing before you can
pipe it into the script. If you use cut to extract the date/time and bytes transferred fields you are
left with data that causes some problems for the script.
$ cut d' ' f4,10 access.log
[12/Nov/2017:15:52:59 2377
[12/Nov/2017:15:52:59 4529
[12/Nov/2017:15:52:59 1112
As seen in the output above, the raw data starts with a [ character. That causes a problem with
the script because it denotes the beginning of an array in bash. To remedy that you can use an
additional iteration of the cut command to remove the character using c2 as an option. This
option tells cut to extract the data by character starting at position 2 and going to the end of the
line (). The corrected output with the square bracket removed can be seen below.
$ cut d' ' f4,10 access.log | cut c2
12/Nov/2017:15:52:59 2377
12/Nov/2017:15:52:59 4529
12/Nov/2017:15:52:59 1112
TIP
Alternatively, you can use tr in place of the second cut. The d option will delete
the character specified, in this case the square bracket.
cut d' ' f4,10 access.log | tr d '['
You also need to determine how you want to group the timebound data; by day, month, year,
hour, etc. You can do this by simply modifying the option for the second cut iteration. The table
below illustrates the cut option to use to extract various forms of the date/time field. Note that
these cut options are specific to Apache log files.
Table 43. Apache Log Date/Time Field Extraction
19 c1415,22 Year
The histogram.sh script can be particularly useful when looking at timebased data. For
example, if your organization has an internal web server that is only accessed during working
hours of 9:00 AM to 5:00 PM, you can review the server log file on a daily basis using the
histogram view and see if there are any spikes in activity outside of normal working hours.
Large spikes of activity or data transfer outside of normal working hours could indicate
exfiltration by a malicious actor. If any anomalies are detected you can filter the data by that
particular date and time and review the page accesses to determine if the activity is malicious.
For example, if you want to see a histogram of the total amount of data that was retrieved on a
certain day and on an hourly basis you can do the following:
$ awk '$4 ~ "12/Nov/2017" {print $0}' access.log | cut d' ' f4,10 |
cut c1415,22 | bash summer.sh | bash histogram.sh
17 ##
16 ###########
15 ############
19 ##
18 ##################################################
Here the access.log file is sent through awk to extract the entries from a particular date. Note
the use of the like operator (~) in stead of == since field 4 also contains time information. Those
entries are piped into cut to extract the date/time and bytes transferred fields, and then piped
into cut again to extract just the hour. From there it is summed by hour using summer.sh and
converted into a histogram using histogram.sh. The result is a histogram that displays the total
number of bytes transferred each hour on November 12, 2017.
$ awk '$1 == "192.168.0.37" {print $0}' access.log | cut d' ' f7 |
bash countem.sh | sort rn | head 5
14 /files/theme/plugin49c2.js?1490908488
14 /files/theme/mobile49c2.js?1490908488
14 /files/theme/custom49c2.js?1490908488
14 /files/main_styleaf0e.css?1509483497
3 /consulting.html
While this can be accomplished by piping together commands and scripts, that requires multiple
passes through the data. This may work for many datasets, but it is too inefficient for extremely
large datasets. You can streamline this by writing a bash script specifically designed to extract
and count page accesses, and only requires a single pass over the data.
Example 48. pagereq.sh
# count the number of page requests from an address ($1)
declare A cnt
while read addr d1 d2 datim gmtoff getr page therest
do
if [[ $1 == $addr ]] ; then let cnt[$page]+=1 ; fi
done
for id in ${!cnt[@]}
do
printf "%8d %s\n" ${cnt[$id]} $id
done
We declare cnt as an associative array (also known as a hash table or dictionary) so that
we can use a string as the index to the array. In this program we will be using the page
address (the URL) as the index.
The ${!cnt[@]} results in a list of all the different index values that have been
encountered. Note, however, that they are not listed in any useful order.
Early versions of bash don’t have associative arrays. You can use awk to do the same thing
counting the various page requests from a particular ip address since awk has associative
arrays.
Example 49. pagereq.awk
# count the number of page requests from an address ($1)
awk v page="$1" '{ if ($1==page) {cnt[$7]+=1 } }
END { for (id in cnt) {
printf "%8d %s\n", cnt[id], id
}
}'
There are two very different $1 variables on this line. The first $1 is a shell variable and
refers to the first argument supplied to this script when it is invoked. The second $1 is an
awk variable. It refers to the first field of the input on each line. The first $1 has been
assigned to the awk variable page so that it can be compared to each $1 of awk that is, to
each first field of the input data.
This simple syntax results in the varialbe id iterating over the values of the index values to
the cnt array. It is much simpler syntax than the shell’s "${!cnt[@]}" syntax, but with
the same effect.
You can run pagereq.sh by providing the IP address you would like to search for and redirect
access.log as input.
$ bash pagereq.sh 192.168.0.37 < access.log | sort rn | head 5
14 /files/theme/plugin49c2.js?1490908488
14 /files/theme/mobile49c2.js?1490908488
14 /files/theme/custom49c2.js?1490908488
14 /files/main_styleaf0e.css?1509483497
3 /consulting.html
Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0
This user agent string identifies the system as: Windows NT version 6.3 (aka Windows 8.1); 64
bit architecture; and using the Firefox browser.
The user agent string is interesting for a few reasons: first because of the significant amount of
information it conveys, which, can be used to identify the types of systems and browsers
accessing the server; second because it is configurable by the end user, which, can be used to
identify systems that may not be using a standard browser or may not be using a browser at all
(i.e. a webcrawler).
You can identify unusual user agents by first compiling a list of known good user agents. For
the purposes of this exercise we will use a very small list that is not specific to a particular
version.
Example 410. useragents.txt
Firefox
Chrome
Safari
Edge
TIP
For a list of common user agent strings visit
https://techblog.willshouse.com/2012/01/03/mostcommonuseragents/
You can then read in a web server log and compare each line to each valid user agent until you
get a match. If no match is found it should be considered an anomaly and printed to standard out
along with the IP address of the system making the request. This provides yet another vantage
point into the data, identifying systems with unusual user agents, and another path to further
explore.
Example 411. useragents.sh
#!/bin/bash
# useragents.sh read through a log looking for unknown user agents
#
# mismatch search through the array of known names
# returns 1 (false) if it finds a match
# returns 0 (true) if there is no match
function mismatch ()
{
local i i
for ((i=0; i<$KNSIZE; i++))
do
[[ "$1" =~ .*${KNOWN[$i]}.* ]] && return 1
done
return 0
}
# read up the known ones
readarray t KNOWN < "known.names"
KNSIZE=${#KNOWN[@]}
# preprocess logfile (stdin) to pick out ipaddr and user agent
awk F'"' '{print $1, $6}' | \
while read ipaddr dash1 dash2 dtstamp delta useragent
do
if mismatch "$useragent"
then
echo "anomaly: $ipaddr $useragent"
fi
done
We will use a function for the core of this script. It will return a success (or “true”) if it finds
a mismatch, that is, if it finds no match against the list of known user agents. This logic may
seem a bit inverted, but it makes the if statement containing the call to mismatch read
clearly.
Declaring our for loop index as a local variable is good practice. It’s not strictly necessary
in this script but is a good habit.
There are two strings to compare the input from the logfile and a line from the list of
known user agents. To make for a very flexible comparison we use the regex comparison
operator (the =~). The .* (meaning “zero or more instances of any character”) placed on
either side of the $KNOWN array reference means that the known string can appear anywhere
within the other string for a match.
Each line of the file is added as an element to the array name specified. This gives us an
array of known user agents. There are two identical ways to do this in bash either
readarray, as used here, or mapfile. The t option removes the trailing newline from
each line read. The file containing the list of known user agents is specified here; modify as
needed.
This computes the size of the array. It is used inside the mismatch function to loop
through the array. We calculate it here, once, outside our loop to avoid recomputing it every
time the function is called.
The input string is a complex mix of words and quote marks. To capture the user agent
string we use the doublequote as the field separator. Doing that, however, means that our
first field contains more than just the ip address. By using the bash read we can parse on
the spaces to get the ip address. The last argument of the read takes all the remaining
words and so it can capture all the several words of the user agent string.
Summary
In this chapter we looked at techniques to analyze the content of log files by identifying unusual
and anomolous activity. This type of analysis can provide you with insights into what occurred
in the past. In the next chapter we will look at how to analyze log files and other data to provide
insights into what is happening in the system in real time.
Practice
After completing the exercises below you will be able to:
Use command line scripts to extract data from log files
Write and modify bash scripts
Exercises
1. Expand the histogram.sh script to include the count at the end of each histogram bar. Here
is sample output:
192.168.0.37 ############################# 2575030
192.168.0.26 ####### 665693
2. Expand the histogram.sh script to allow the user to supply the option s that specifies the
maximum bar size. For example histogram.sh s 25 would limit the maximum bar
size to 25 # characters. The default should remain at 50 if no option is given.
3. Download the following web log file TODO: Add Log File URL.
a. Which IP address made the most number of requests?
b. Which page was accessed the most number of times?
4. Download the following Domain Name System (DNS) server log TODO: Add Log File
URL
a. What was the most requested domain?
b. What day had the most number of requests?
5. Modify the useragents.sh script to add some parameters
a. Add code for an optional first parameter to be a filename of the known hosts. If not
specified, default to the name known.hosts as it currently is used.
b. Add code for a f option to take an argument. The argument is the filename of the
logfile to read rather than reading from stdin.
6. Modify the pagereq.sh script to not need an associative array but to work with a traditional
array that uses a numerical index. Convert the ip address into a 1012 digit number for that
use. Caution: don’t have leading zeros on the number or the shell will attempt to interpret
it as an octal number. Example: convert “10.124.16.3” into “10124016003” which can be
used as a numerical index.