DIFO2023 Lab1
DIFO2023 Lab1
Johannes Olegård
version: 2023-10-20 20:00
Introduction
This lab aims to introduce you to some forensic techniques and try to apply them. We mainly
focus on Windows artifacts, but if you get the idea, you should be able to extrapolate to other
systems.
You will need a Linux-like system to run various programs to do this assignment. Kali Linux
is recommended and is already installed directly onto computers in the lab room, but you can use
your own computer if you like.
If you run macOS, it might work without a Linux VM, but you may need to substitute some
programs/commands for mac equivalents (so a Linux VM is probably easier). Later in the as-
signment, you will use a VirtualBox-based Windows 10 VM (that we provide) to do experiments,
so you will still need VirtualBox. Note that in VirtualBox you add VBOX-files, but you use
import appliance on OVA-files.
There are no harmful files in this assignment (such as malware), so it is safe to analyze it on
your own computer.
Passwords / Credentials
extftp.cs2lab.dsv.su.se cs2lab:dsvcs2
labcomputers cs2lab:dsvcs2
kali VM kali:kali
windows VM cs2lab:dsvcs2
Handin
Hand in a single PDF in iLearn consisting of:
1. A frontpage that clearly states the names and email addresses of each group members, and
the name of the assignment the handin is for (i.e. include the text “DIFO 2023 lab1”).
2. A list of answers to all the “Qn:” questions (there is a total of 59 questions). There is no
need to explain how you got the answers unless explicitly asked.
If you have issues with a group member not participating, please state so on the front page. We
will contact all group members by email to hear both sides of the conflict. Usually, the result is
that we kick the non-contributing member out of the group (so they will have to join a new group
and submit lab1 again), and only the participating members (the ones listed on the assignment)
get a grade for the assignment. See also the DSV code of honor1 .
1 https://www.su.se/department-of-computer-and-systems-sciences/education/during-your-studies/code-of-
honour-at-dsv-1.548067
1
Some tips, tricks and quality of life suggestions
Keyboard layout. The kali VM might not realize that you use a Swedish keyboard. You can
tell it that by clicking on the button in the top left corner of the screen (the kali logo) and typing
keyboard and clicking on the Keyboard app. Next, go to the layout tab and change the
language.
RTFM. Whenever you are asked to run a command with certain options, look up what the
command (e.g. curl ) does and what each option does (e.g. -L ). This will leave you a lot less
confused and a command used in one assignment might also be useful in another assignment (e.g.
sort , uniq -c and less ). ChatGPT can help but it is also good to know how to quickly
search a man page (since chatGPT does not know everything).
Parallelization. If you want your commands to faster, and your linux machine has multiple
CPU cores, then you can try parallel (GNU parallel). For example: ls -1 *.bin | parallel ./a.sh
where the file a.sh contains:
#!/bin/bash
if (( $# != 1 )) ; then
echo "ERROR: wrong num args" 1>&2;
exit 1
fi
2
Contents
1 Part 1—Dumpster Diving (trash.zip) 4
1.1 File type classification using a tool . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 File type classification using a hex editor . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Text search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Visualize the difference between two files . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Filtering with NSRL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.7 Fuzzy file comparison using ssdeep . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8 Photo Metadata extraction using Exiftool . . . . . . . . . . . . . . . . . . . . . . . 7
1.9 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.10 Finding encrypted files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.11 Password cracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.12 Steganography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3
1 Part 1—Dumpster Diving (trash.zip)
For Part 1 all tasks are about dataset trash.zip , or “the dataset” for short. The dataset consists
of almost 4000 files named 1.bin, 2.bin, and so on. Your job is to analyze these files from various
perspectives, using various command line tools. It is useful for forensic analysis to think of a hard
drive as a dataset of files (or file-like things) in this way.
$ curl https://extftp.cs2lab.dsv.su.se/DIFO/2023/lab1/trash.zip
$ unzip -d trash/ trash.zip
$ ls trash
$ file *.bin
Combine file with tools like sed , grep , sort and uniq to answer these question:
Q1: How many photos/images are there in the dataset?
Q2: What does the file -command say when it does not recognize a binary file format?
Q3: How many files did the file -command not recognize?
Q4: Write a command line that answers the above question (i.e. a line of text that can be
pasted into the kali terminal) .
Q5: In at most 200 characters, explain briefly how that command line works.
$ bless
$ xxd 1.bin
Q6: Compare the first few bytes of the photo/image files you found. What are the three most
common “magic numbers” you see?
You might want to check the options for xxd to make this question easier.
$ strings 1.bin
$ grep -a smurf *.bin
$ strings *.bin | grep -o 'smurf'
Q7: Which files contain the text “heffaklump” in either ASCII or UTF-8?
Q8: Which files contain the text “heffaklump” in little endian UTF-16 encoding?
Q9: Write a grep pattern to find IPV4-addresses (e.g. 193.123.156.189 ). (It does not have
to be perfect, some false positives are okay.)
2 https://www.garykessler.net/library/file_sigs.html
4
Q10: How many unique IPv4-addresses can you find in the dataset? Use the pattern you
described above.
1.4 Hashing
Hashing is used in at least three ways in digital forensics: preservation, file comparison and fuzzy
file comparison (including malware analysis). For preservation we hash a file before processing it
so that we later can check that we did not accidentally modify the file. Hashes can be used to
check if two files are identical or almost identical.
On Linux, here are some tools to compute hashes of (whole) files:
$ sha1sum 1.bin
$ sha256sum 1.bin
$ md5sum 1.bin
$ hashdeep 1.bin
Q11: Which groups of files in the dataset share the same SHA1 hash? (Tip: sort | uniq -c
and grep )
For example, files 2535.bin and 2536.bin both have the same hash and no other files in
the dataset have that hash, so these two files make up one of the groups.
Q12: What command line did you use to answer the above question?
Q13: Explain in at most 200 characters how that command line works.
Q14: There is something odd about the hashes of four files in the dataset. Which ones? (hint:
compare sha1 and md5 using the command line you developed in the above questions).3
File 3614.bin contains the output of sha1sum command. Check if those hashes are still
correct. There is a neat option for sha1sum that does this:
$ sha1sum -c 3614.bin
Q16: Compare files 1315.bin and 2091.bin (using e.g. vimdiff ). How many bytes differ?
5
Software Reference Library (NSRL) Reference Data Set (RDS)” 4 . There should already be a copy
of it on the lab computers ( /home/kali/difo/nsrl ). Using this hash set we can filter out what
files are not interesting. As you can imagine, we can in a similar way build a hash set of illicit images
to filter out files that are instead very interesting forensically (which we will not demonstrate here).
Here is some code to download and query NSRL RDS:
Q17: Write a python script to determine which files are in the database. Below is a stub you
can start from ( 894.bin ). Beware of upper and lower case hashes when comparing! It is probably
fastest to use SHA256.
#!/usr/bin/env python3
import hashlib
import sqlite3
import pathlib
import argparse
def sha256_hash_file(path):
hash = 123 # TODO write code to calculate the hash of the file named by path
return hash
def main():
# $ python3 script.py database_file files_to_hash...
parser = argparse.ArgumentParser()
parser.add_argument('database_file')
parser.add_argument('files_to_hash', nargs='+')
args = parser.parse_args()
#print(args.database_file, args.files_to_hash)
if __name__ == '__main__':
main()
$ python3 ./myscripy.py
,→ ~/difo/nsrl/RDS_2023.03.1_modern_minimal/RDS_2023.03.1_modern_minimal.db
,→ ~/difo/trash/*.bin
900.bin 13BC70E4D044FD383194F1FA9C7C102D0F8D2B81302E80B0E54693470AD4B6A7
... (and so on, 2756.bin should not be here!)
nsrl-download/current-rds
6
1.7 Fuzzy file comparison using ssdeep
Sometimes, we want to find similar (but not exactly identical) files, such as two versions of the same
MS word document, or a (slightly) photo-shopped photo and its original. Tools like ssdeep -d
can do this and works by (intelligently) chopping the file into smaller chunks, hashing the chunks
individually and comparing files based on common chunks.
Q19: What are the 10 most similar pairs of files in the dataset, excluding pairs that are exactly
identical?
$ exiftool 1.bin
Q20: Which photo files in the dataset were taken by a Samsung S9 smartphone camera (ac-
cording to the EXIF metadata)?
Q21: Which files in the dataset contain GPS coordinates in their EXIF data?
Q22: For each of those files, which was the closest Swedish city/town from that photo taken?
See: 5 6
1.9 Entropy
One way to detect encrypted or compressed files is by testing for high information entropy7 . There
are a few tools that can do this:
$ binwalk -N -E *.bin
$ ent 1.bin
7
Tools like john and hashcat take as input a list of hashes (and the hash algorithm that
generated them). So to crack a password-protected file, we first must convert it into a list of
hashes.
Q24: File 1924.bin is a unix-style “passwd” file and 651.bin is a unix-style “shadow” file
(both from the same computer). What is the password of user drsnuggles ?
Q25: Files 2834.bin and 1596.bin consist of the “SYSTEM” and “SAM” registry files of a
Windows 10 machine. What is the password of user inspectorgadget?
In digital forensics we might know something about the person whose password we are trying
to crack. People tend to reuse passwords (we’re all guilty—please start using a password manager
like BitWarden8 !). Similarly, people tend to use words and numbers from things they like (like
the names and birthyears of their kids, or names of fictional characters, sports teams, concepts
or celebrities). Or they just wrote the password down in a text-file on their computer. So if we
have a list of known passwords (from the same person or other people) or a list of strings from
that person that could be passwords, then we can use that as the “wordlist” of passwords to crack.
When we do not provide a wordlist, john will use a default list (and resort to brute forcing when
it runs out).
Q26: There is a password-protected PDF in the dataset. What is the title of that PDF (i.e.
the title text written inside that PDF)? Use file 3616.bin as a wordlist (it contains a list of all
4-character long lower-case only strings).
$ pdf2john something.pdf > pdf.johnformat # extract hash and save in esoteric file
,→ format
$ john --wordlist=3616.bin pdf.johnformat
$ john --show pdf.johnformat # show the password
$ qpdf -password=PUTTHEPASSWORDHERE -decrypt something.pdf
,→ something.versionwithoutpassword.pdf
$ evince something.pdf # the evince GUI will prompt for the password
Q27: There is a password-protected ZIP-file in the dataset. What is the total of John Doe?
The password is in a file somewhere in the file dataset. Use strings (and sort -u ) on the
dataset and use the output as a wordlist. Hint: the password is between 7 and 30 characters
long, contains only: digits, lowercase ASCII letters, uppercase ASCII letters and punctuation from
%&/()!"#$ . It has at least one character from each of these four classes.
Note that tools like john also support generating more passwords from a wordlist (e.g. using
password to generate “ P@$sw0rD123 ”), which we did not need here.
1.12 Steganography
Steganography is about hiding data inside other data, typically in plain sight—for example by
manipulating the pixels in a photo in a way that is difficult to perceive by human eye. Typically
8 https://bitwarden.com/
8
steganography is difficult to detect (especially if the message is encrypted before being hidden)
and we often have to test individual techniques/tools one by one. steghide is one tool for hiding
data inside JPEG-files (by manipulating pixel/color data) and there are many others.
Q28: I have used steghide to hide a string of text in one of the JPEG files in the dataset.
Which file?
Q29: What is the “fourth step” mentioned in the embedded file?
9
2 Part 2—Working with disks (brain.raw, brainram.dmp)
In this part of the lab will examine the raw disk image brain.raw , which I will refer to as “the
disk”, below. Later, we will also look at a RAM dump from the same machine, brainram.dmp .
2.1 Preservation
I have already done the task of “collecting” the disk for you. As part of that process I computed
hashes of that disk. As a forensic analyst, your first task would be to compute and verify that this
hash is still correct.
Q30: What is the sha1 hash of the disk?
Q31: What is the md5 hash of the disk?
$ mmls -B brain.raw
Note that a “sector” is 512 bytes and that mmls by default gives you offsets counted in sectors.
Q32: What is the sector-offset of the largest partition?
Q33: What is the byte-offset of the largest partition?
Q34: How many sectors long is the largest partition?
Q35: How many bytes long is the largest partition?
If you look at the .raw -file in a hexeditor you should see the text NTFS at near the start of
the largest partition.
Q36: How many (allocated) files are in the largest partition of the disk? Hint: use wc -l .
Note that to extract all (allocated) files it is faster to run tsk_recover . This might be useful
later.
10
2.4 Carving MFT entries
The scalpel tool can be used to do simple file carving. In /etc/scalpel/scalpel.conf you
can see a example configuration for scraping various based on their first few bytes (header) and
last few bytes (footer).
Q38: Write a scalpel configuration file for carving only MFT entries.9
Note that scalpel can take a while to run, so you may want to leave this running while you do
other tasks further below.
Q39: How many MFT entries did you manage to carve?
Scalpel will output an audit.txt file that describes where each file was carved from.
Q40: Look at the MFT that was carved from byte-offset 3320834624 (it is easier to look at
the file you extracted, but you can also look directly in the disk file). What is the modification
timestamp of this MFT10 ?
In python you can parse a WINFILETIME 11 12 13 like this:
Q41: Look at the MFT entry that was carved from byte-offset 3320834624. What is the
filename of this MFT entry?
Q42: Look at the MFT entry that was carved from byte-offset 3320834624. Is this a file or a
directory?
Check /etc/scalpel/scalpel.conf on how to carve jpg -files. Make sure your config only
carves jpg and nothing else (or scalpel will easily fill up all of your diskspace with junk).
Q43: How many JPEG-files did you find?
Note that scalpel can take a while to run, so you may want to leave this running while you do
other tasks further below.
9 Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. Online: https://raw.
githubusercontent.com/Urinx/Books/master/Forensic/File%20System%20Forensic%20Analysis.pdf
10 13Cubed MFT structure https://www.youtube.com/watch?v=l4IphrAjzeY
11 https://gist.github.com/kosh04/36cf6023fb75b516451ce933b9db2207
12 https://www.silisoftware.com/tools/date.php?inputdate=132775510287135457&inputformat=filetime
13 https://stackoverflow.com/a/6161842
11
Each MFT entry (basically a file) has a list of “attributes” 14 . An attribute is a sequence of data
and a description of where it is stored. Usually, the attribute will point somewhere on the disk,
but for really small amounts data it will be stored inside the MFT entry itself. Each attribute type
has a number (e.g. 48) and a name (e.g. $FILE_NAME).
Some interesting MFT entry types include:
Note that the list of attributes in an MFT can contain multiple attributes of the same type
(which have to be combined to be interpreted). Each attribute also contains an ID that we can
use to distinguish it from other attributes of the same type. In sleuthkit we can reference files or
their attributes in the following way: MftNumber-AttributeTypeNumber-AttributeId .
Q44: In the folder C:\difo_resident on the disk, there are four files. Which of these use
resident $DATA attributes?
12
Q48: There was a connection from ?.?.?.?:X to 193.10.9.5:443. What was port X?
Q49: How many occurrences of the password dsvcs2 can you find in the blob? Hint:
strings
3.4 Zone.identifier
When you download something from the internet in windows 10, it will add a special “alternate
data stream” named Zone.identifier to those files to indicate that they were downloaded from
the internet.
Q52: There are two files in the C:\difo_carve_me -directory. According to their respective
Zone-identifier-files, Where were the files downloaded from?
information
13
Similarly, there is \Users\MYUSERNAME\AppData\Microsoft\Windows\USRCLASS.dat , which
has a similar role to NTUSER.dat . Notably, it contains “shell bags” 18 , which is essentially the
settings that Windows Explorer keeps for each directory (window size, big icons vs small icons,
etc.). It is interesting because it contains timestamps of when a user was looking at that directory
using Explorer, so we can use it to try to prove that a person knew about the contents of that
directory.
Now, go to Tools > run ingestion module > brain.raw and run only the Recent Activity
ingestion module. You should see it start loading in the bottom right corner of the screen. Once
it is done, go to the Data Artifacts > Shell Bags view on the left-hand side of the screen.
Q55: When was the shell bag for C:\difo_gone_small last modified in local time?
3.7 Prefetch
If you did not already run the Recent Activity ingestion module, then do so now (only do it
once per case or weird things happen). Now go to Data Artifacts > Run programs (look for
Prefetch in the Comments field).
Q58: How many times has chrome.exe run, according to windows prefetch?
14
we will explore how to use VirtualBox to do such experiments. Use the following Windows VM:
brainvm.zip , but if you have a running Windows 10 system that is fine too.
A VirtualBox VM consists of a .vbox -textfile of settings and a .vdi -file (or other file format)
that represents the virtual hard disk of the VM. When you take a snapshot, VirtualBox will create
another .vdi -file (disk) and .sav -file (RAM and other stuff). So if we take a snapshot of a VM,
then copy all .vdi and .sav -files, then we have our acquisitions. If we want to be prim and
15
proper, we would save hashes for all those files to preserve them, but this is a bit overkill for our
experiments. Below is an example of what it might look like:
myvm/
myvm.vbox # metadata and settings
myvm.vdi # BIG original hard disk (not diff)
Snapshots/
{01a51ac7-33cc-4d98-b1a1-c11d7bb15b26}.vdi # currently running snapshot
2023-09-07T15-42-02-310924000Z.sav # snapshot A RAM
{01a51ac7-33cc-4d98-b1a1-c11d7bb15b26}.vdi # snapshot A disk (diff)
2023-09-07T16-28-23-410287000Z.sav # snapshot B RAM
{f8b691e8-3f27-4c1a-88e0-82d9cfdb2d60}.vdi # snapshot B disk (diff)
DISK. To use these files, we must convert them into something our tools can analyze. Autopsy
claims to be able to read “VM image” files but they are lying. To get autopsy to read a snapshot
disk, we must “flatten it”. You can do this in the VirtualBox GUI by going to Tools , clicking the
≡ symbol then going to Media . In here you should see the snapshot disks for all VMs with their
UUID-names. Right click the snapshot and choose copy to a new VMDK-file somewhere. This will
“flatten’ the snapshot so that it has the whole disk. Once it is done copying, convert the vmdk-file
to raw:
RAM. The method of converting the .sav -file into something that volatility can understand,
involves running the snapshot and then dumping RAM from the running VM. Since this is slow
(many extra steps), I recommend running dumpvmcore (see the full command below) just before
taking the snapshot, instead of later trying to convert the .sav -file.
If you forget the & you will need to run the second command in a separate terminal window
(alternatively, do CTRL+Z and then run bg to put the first command into the background).
16