Digital Investigation: Lorenz Liebler, Patrick Schmitt, Harald Baier, Frank Breitinger
Digital Investigation: Lorenz Liebler, Patrick Schmitt, Harald Baier, Frank Breitinger
Digital Investigation
journal homepage: www.elsevier.com/locate/diin
a r t i c l e i n f o a b s t r a c t
Article history: In recent years different strategies have been proposed to handle the problem of ever-growing digital
forensic databases. One concept to deal with this data overload is data reduction, which essentially
means to separate the wheat from the chaff, e.g., to filter in forensically relevant data. A prominent
Keywords: technique in the context of data reduction are hash-based solutions. Data reduction is achieved because
Database lookup problem hash values (of possibly large data input) are much smaller than the original input. Today's approaches of
Artifact lookup
storing hash-based data fragments reach from large scale multithreaded databases to simple Bloom filter
Approximate matching
representations. One main focus was put on the field of approximate matching, where sorting is a
Carving
problem due to the fuzzy nature of the approximate hashes. A crucial step during digital forensic analysis
is to achieve fast query times during lookup (e.g., against a blacklist), especially in the scope of small or
ordinary resource availability. However, a comparison of different database and lookup approaches is
considerably hard, as most techniques partially differ in considered use-case and integrated features,
respectively. In this work we discuss, reassess and extend three widespread lookup strategies suitable for
storing hash-based fragments: (1) Hashdatabase for hash-based carving (hashdb), (2) hierarchical Bloom
filter trees (hbft) and (3) flat hash maps (fhmap). We outline the capabilities of the different approaches,
integrate new extensions, discuss possible features and perform a detailed evaluation with a special focus
on runtime efficiency. Our results reveal major advantages for fhmap in case of runtime performance and
applicability. hbft showed a comparable runtime efficiency in case of lookups, but hbft suffers from
pitfalls with respect to extensibility and maintenance. Finally, hashdb performs worst in case of a single
core environment in all evaluation scenarios. However, hashdb is the only candidate which offers full
parallelization capabilities, transactional features, and a Single-level storage.
© 2019 The Author(s). Published by Elsevier Ltd on behalf of DFRWS. This is an open access article under
the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
https://doi.org/10.1016/j.diin.2019.01.020
1742-2876/© 2019 The Author(s). Published by Elsevier Ltd on behalf of DFRWS. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/
licenses/by-nc-nd/4.0/).
L. Liebler et al. / Digital Investigation 28 (2019) S116eS125 S117
mechanisms to handle common blocks, e.g., by integrating func- 2. We inspect the feasibility and discuss concepts to integrate the
tions of filtration or deduplication. Those requirements influence missing feature of multihit prevention (filtration of common
the applicability of a specific lookup strategy. The consideration of blocks) similar to hashdb, into hbft or fhmap.
common blocks influences the results of higher level analysis, too 3. Discuss possible extensions of existing techniques in order to be
(e.g., approximate matching used for the task of identifying similar able to compare the approaches.
binaries or detecting shared libraries (Liebler and Breitinger, 2018; 4. Assess how the different approaches compete with respect to
Pagani et al., 2018)). runtime performance and resource usage.
In this work we discuss, reassess and extend three widespread
lookup strategies suitable for storing hash-based fragments which Our result is that fhmap is best in case of runtime performance
have been proposed and are currently utilized in the field of digital and applicability. hbft showed a comparable runtime efficiency in
forensics: case of lookups, but hbft suffers from pitfalls with respect to
extensibility and maintenance. Finally, hashdb performs worst in
1 hashdb: In 2015 Garfinkel and McCarrin (2015) introduced Hash- case of a single core environment in all evaluation scenarios,
based carving, “a technique for detecting the presence of specific however, it is the only candidate which offers full parallelization
target files on digital media by evaluating the hashes of indi- capabilities and transactional features.
vidual data blocks, rather than the hashes of entire files”. The remainder of this work is structured as follows: In Section
Common blocks were identified as a problem and have to be ‘Candiate and features analysis’ we give a short introduction in our
handled or filtered out, as they are not suitable for identifying a considered use case. In addition, an overview of the depicted
specific file. To handle the sheer amount of digital artifacts and evaluation candidates and their already integrated features is given.
to perform fast and efficient queries, the authors utilized a so In Section ‘Extensions to hbft and fhmap’ we describe our proposed
called hashdb. The approach was integrated into the bulk_ex- extensions and our performed evaluation of those. We give a
tractor forensic tool. Both implementations have been made detailed performance evaluation and discuss advantages and dis-
publicly available.1 advantages of the different techniques in Section ‘Evaluation’.
2 hbft: In the scope of approximate matching, probabilistic data Finally, we conclude this work.
structures have been proposed to reduce the amount of needed
memory for storing relevant artifacts. Approaches to store ar- Candidates and feature analysis
tifacts comprise multiple Bloom filters (Breitinger and Baier,
2012), single Bloom filters 8 or more exotic Cuckoo filters (Fan As all of the mentioned approaches strongly differ (either in
et al., 2014; Gupta and Breitinger, 2015). One major problem of their original use case, or in their supported capabilities) we outline
probabilistic data structures is the fact of losing the ability to the motivation behind our decision of the depicted candidates.
actually identify a file. In 2014 Breitinger et al. (2014b) provided Therefore, we first describe the conditions of application and
a theoretical concept of structured Bloom filter trees for iden- introduce the forensic use case which formalizes additional re-
tifying a file. In 2017, a more detailed discussion and concrete quirements and mandatory features (see Section ‘Use case and re-
implementation was provided by Lillis et al. (2017). The quirements’). Afterwards, we explain our three candidates of choice
approach is based on “the well-known divide and conquer in Section ‘Depicted Candidates’. Beside the required features of our
paradigm and builds a Bloom filter-based tree data structure in considered use case, we need to discuss the already present fea-
order to enable an efficient lookup of similarity digests”. This tures and capabilities of the different approaches. Thus, we outline
leads to Hierarchical Bloom filter trees (hbft). the existing features and capabilities for each candidate in Section
3 fhmap: Recently Malte Skarupke presented a fast hash table ‘Feature Analysis’.
called flat_hash_map2 (fhmap). The author claims that the
implementation features the fastest lookups until now. A hash Use case and requirements
table features a constant lookup complexity of Oð1Þ given a good
hash function. The database implementation provides an In this work we address the problem of querying digital artifacts
interface for accessing the hash table itself, however, it does not out of a large corpus of relevant digital artifacts. Sample applica-
feature any image slicing, chunk extraction or hashing. Thus, in tions are carving or approximate matching. Both applications suffer
order to utilize and evaluate fhmap in our context, it has to be from the Database Lookup Problem, i.e., how to link an extracted
extended by additional concepts to extract data fragments artifact within a forensic investigation to a corresponding source of
comparable to Hash-based carving or fuzzy hashing. a forensic corpus efficiently (i.e., in terms of required storage ca-
pacity, required memory capacity or lookup performance). Beside
Considering our depicted candidates and the overall goal of those, our digital forensic scenarios bare additional pitfalls and
reassessing those in terms of capabilities and performance, the challenges.
goals of this paper are as follows: We consider the extraction of chunks (i.e., substrings) out of a
raw bytestream, without the definition of any extraction process. A
1. Assess the aforementened proposed techniques for the task of major challenge of matching an artifact to a source are occurring
fast artifact handling (i.e., hbft, fhmap and hashdb). Identify the multihits, i.e., one chunk is linked to multiple files. This was first
capabilities of those techniques and the possible handling of mentioned by Foster (2012). A multihit is also called a common
common blocks. block or a non-probative block, too. For instance Microsoft Office
documents such as Excel or Word documents share common byte
blocks across different files (Garfinkel and McCarrin, 2015). Similar
problems occur during the examination of executable binaries
which have been statically linked (e.g., share a large amount of
1
https://github.com/simsong/(last accessed 2018-10-23).
2
common code or data). Summarized, multi matches are a challenge
You Can Do Better than std::unordered_map: New and Recent Improvements to
Hash Table Performance presented by Malte Skarupke at CþþNow in 2018; https://
for identifying an unknown fragment with full confidence. In
probablydance.com/2018/05/28/a-new-fast-hash-table-in-response-to-googles- addition, storing multihits also increases memory requirements
new-fast-hash-table/(last accessed 2018-10-23). and decreases lookup performance.
S118 L. Liebler et al. / Digital Investigation 28 (2019) S116eS125
Multihits can either be identified during the construction phase performed regulary. The read-only memory and the filesystem are
of the database (e.g., by deduplication or filtration) or during the kept coherent through a Unified Buffer Cache. The size is restricted
lookup phase. By filtration of common blocks during a construction by the virtual address space limits of an underlying architecture. As
phase, the overall database load gets reduced and the lookup speed mentioned by Chu (2011), on a 64 bit architecture which supports
is increased as only unique hits are considered. Two different 48 addressable bits, this leads to an upper bound of 128 TiB of the
strategies of multihit prevention during the construction phase database (i.e., 47 bits out of 64 bits).
were proposed. First, as introduced by Garfinkel and McCarrin hbft. The concept of hierarchical Bloom filter trees (hbft) is fairly
(2015) rules are defined to filter out known blocks with a high new. This theoretical concept was introduced by Breitinger et al.
occurrence (and thus low identification probability of an individual (2014b) and later implemented by Lillis et al. (Lillis et al., Scan-
file). Such an approach requires extensive pre-analysis of the input lon). The lookup differs from the approximate matching algorithm
set and its given structures. A second approach is the filtration of mrsh-v2, as hbft only focuses on fragments to identify potential
common blocks during construction by the additional integration buckets of files. A parameter named min_run describes how many
of a deduplication step. Beside hashdb, none of our candidates consecutive chunk hashes need to be found to emit a match. A good
provide deduplication or multihit prevention techniques so far. We recall rate was accomplished for min_run ¼ 4. The tree structure is
refer to Garfinkel and McCarrin (2015) for further details and solely then traversed further if a queried file is considered a match in the
focus on the utilized database in the following subsection. root node. Each of the nodes is represented by a single Bloom filter
Features of adding and deleting artifacts have the major benefit which empowers to traverse the tree. A traditional pairwise com-
of not needing to re-generate the complete database everytime a parison can be done at possible matching leaf nodes. For details of
new artifact needs to be included. While deleting inputs may be the actual traversing concept we refer to the original paper (Lillis
less frequent, adding new items to an existing storage scheme et al., Scanlon).
seems obvious and indispensable. While fhmap and hashdb support Just like previous mrsh implementations the lookup structure
adding and deleting hashes from their scheme, this feature is not can be precomputed in advance. First, the tree is constructed with
yet available in the current prototype of hbft. In detail, adding new its necessary nodes. Then the database files are inserted. Thus, the
elements to a hbft is possible, however, the tree needs to be re- time for construction can be neglected in the actual comparison
generated as soon as a critical point of unacceptable false posi- phase. More precisely the tree structure is represented as a space
tives is reached. The definition of buckets also limits the capabilities efficient array where each position in the array points to a Bloom
to add further files to the database. Loosing the capabilities of de- filter. The implementation uses a bottom up construction which
leting elements out of a binary Bloom filter is the main reason for fills trees from the leaf nodes to the root. The array representation
making features of deletion impossible to realize. Summarized, does not store references to nodes, its children, or leaves explicitely.
adding and deleting hashes from a database is a mandatory or Every reference needs to be calculated depending on the index in
optional feature, depending on the specific use case. the array. Efficient index calculations are only applicable for binary
trees. The lookup complexity within the tree structure is Oðlogx ðnÞÞ,
Depicted candidates where x describes the degree of the tree and n is the file set size.
fhmap. Flat hash maps have been introduced as fast and easy to
In this section we present three widespread lookup strategies realize lookup strategies. Up to now, they have been mainly dis-
suitable for storing hash-based fragments: (1) Hashdatabase for cussed in different fields of application. Similar to hbft, the actual
hash-based carving (hashdb), (2) hierarchical Bloom filter trees implementation of fhmap represents a proof of concept imple-
(hbft) and (3) flat hash maps (fhmap). mentation with good capabilities but limited features.
LMDB/hashdb. To store the considered blocks Garfinkel and The concept of flat hash maps is an array of buckets which
McCarrin (2015) make use of hashdb, a database which provides contain multiple entries. Each entry consists of a key-value pair.
fast hash value lookups. The idea of hashdb is based on Foster The key part represents the identifier for the value and is usually
(2012) and Young et al. (2012). In 2018, the current version (3.13) unique in the table. An index of the bucket is determined by a
introduces significant changes compared to the original version hash function and a modulo operation. The position i equals to:
mentioned by Garfinkel and McCarrin (2015). hashðkeyÞmod sizeðtableÞ. A large amount of inserts into a small
The former implementation of hashdb originally supported B- table causes collisions, where multiple items are inserted in the
Trees. Those have been replaced by the Lightning Memory Mapped same bucket. A proper hash function needs to be chosen in order to
Database (LMDB) which is a high-performance and fully trans- maintain a lookup complexity of Oð1Þ. The function needs to spread
actional database (Chu, 2011). It is a key-value store based on the entries without clustering. The amount of inserted items is
B þ Trees with shared-memory features and copy-on-write se- denoted by the load factor, i.e., the ratio of entries per buckets. A
mantics. The database is read-optimised and can handle large data high load factor obviously causes more collisions. The table gets
sets. The technique originally focused on the reduction of cache slower since the buckets have to be traversed to find the correct
layers by mapping the whole database entirely into memory. Direct entry. The lower the load factor the faster the table. However, more
access to the mapped memory is established by a single address memory is required since buckets will be left empty on purpose. If a
space and by features of the operating system itself. Storages are slot is full the entries are re-arranged.
considered as primary (RAM) or secondary (disk) storages. Data The whole table is implemented as a contiguous (flat) array
which is already loaded can be accessed without a delay as the data without buckets which allows fast lookups in memory. With linear
is already referenced by a memory page. Accessing not-referenced probing the next entry in the table is checked if its free. If not the
data triggers a page-fault. This in turns leads the operating system next one is checked until either a free slot is found or the upper
to load the data without the need of any explicit I/O calls. Sum- probing limit is reached. The table is re-sized as soon as a defined
marised, the fundamental concept behind LMDB is a single-level limit is reached. The default load factor of this table is 0.5. Specific
store, the mapping is read-only and write operations are features should speed up the lookup phase: open addressing, linear
probing, Robin Hood hashing, prime number of slots and an upper
limit probe count. Robin Hood hashing introduced by (Celis et al.,
3
http://downloads.digitalcorpora.org/downloads/hashdb/hashdb_um.pdf (last 1985) ensures that most of the elements are close to their ideal
accessed 2018-10-23). entry in the table. The algorithm rearranges entries: elements
L. Liebler et al. / Digital Investigation 28 (2019) S116eS125 S119
Table 1
Features of hashdb, hbft and fhmap. New implementations are marked with asterisks (symbol *, table cell is coloured green) and potential techniques are marked with a caret
(symbol ^, table cell is coloured red).
which are very far will be positioned closer to their original slot, Multithreading support. With respect to multithreading sup-
even if it is occupied by another element. The element which oc- port hashdb (or Lightning Memory Mapped Database) allows mul-
cupies this specific slot will also be rearranged from its possibly tithreaded reading operations to improve the runtime
ideal slot to enable approximately equal distances for each element performance. However, up to now no theoretical or practical con-
to its ideal position in the table. The algorithm takes slots from rich cepts are available to integrate mulithreading support into hbft and
elements, which are close to their ideal slot, and gives those slots to fhmap, respectively. Nevertheless the block building phase may use
the poor, elements which are very far away, hence the name. multithreading for both hbft and fhmap.
Multihit handling. hashdb associates hashed blocks with meta
Feature analysis data. A meta data describes the count of matching files for a specific
block. The saved counts are used to remove duplicates from a
In what follows we shortly inspect the capabilities and proper- database and to rule out multi matches in advance. In detail, all
ties of all considered techniques. We discuss the current state of hashed blocks are removed which have an associated counter value
each approach in the case of existing features and properties. higher than one (Garfinkel and McCarrin, 2015). The current pro-
Table 1 provides a summary of the discussed capabilities. Note, the totypes of both hbft and fhmap do not provide any functionality of
marks (*) and (^) mean that these attributes are introduced or deduplication. Thus, a feature for filtering common blocks and
discussed, respectively, in the course of this work. For instance, an common shared chunks is missing. We address this problem in
important extension in the case of fhmap is the integration of an Section 3 and extend both prototypes.
appropriate chunk extraction and insertion technique. Add and remove hashes. hashdb supports adding new hashes
Block Building. In case of hashdb the database building and into a database (with integrated deduplication). First the new files
scanning of images is now possible without the use of bulk extractor are split into blocks and hashed. Afterwards, hashes are inserted
which was originally proposed to extract chunks. It builds and into the database. The implementation also supports deletion of
hashes the blocks with a fixed sliding window which shifts along a hashes from a given database by subtracting a database from the
fixed step size s. Obviously this produces quite a lot of block hashes original. In order to insert new files into an existing hbft database,
to be stored in the database. Similar to the original mrsh-v2 algo- the tree needs to be rebuilt with the new file hashes if the tree was
rithm, the current hbft implementation identifies chunks by the limited for the original file set. One can save the original block
usage of a Pseudo-Random-Function (PRF). As soon as the current hashes in order to avoid rehashing. The current concept of the
byte input triggers a previously defined modulus a new chunk structure does not provide any functionality for deleting a given
boundary is defined. The current implementation sticks to the chunk hash. This is obviously primarily caused by the nature of
originally proposed rolling_hash (Breitinger and Baier, 2012). Thus, Bloom filters. The original implementation of fhmap supports
the extraction of chunks relies on the current context of an input adding and deleting entries by default. After determining the index
sequence and not on a previously defined block size. Those in the table, the entries are re-arranged as soon as a slot is full. After
Context-Triggered Piecewise-Hashing (CTPH) algorithms prevent adding or deleting, the table is optionally re-sized to a final load
issues by changing starting offsets of an input sequence. The fixed factor of 0.5.
defined modulus bm approximates the extracted block size in Prefiltering of non-matches. hashdb provides prechecking by a
average. Unlike hashdb or hbft, the fhmap implementation itself Hash Store. A Hash Store is described as a highly compressed opti-
does obviously not feature any block building or hashing. As it will mized store of all block hashes in the database.4 In case of hbft the
just serve as a container, we have to extend the capabilities to root Bloom filter provides an easy discrimination between a match
extract, hash and store fragments during evaluation. and a non-match and thus, yielding prefiltering of non-matches.
Block-Hashing Algorithm. In case of hashdb the authors pro- The current version of fhmap does not provide any prechecking
pose the cryptographic hash function MD5 and an initial blocksize b or prefiltering mechanisms. The additional implementation of a
of 4 KiB. A value which is obviously inspired by the common cluster Bloom filter may solve this problem and is object of future research.
size of todays’ file systems. hbft makes use of the FNV hash function False Positives. Similar to a probabilistic lookup strategy like
to hash its chunks and sets 5 bits in its leaf nodes and the corre-
sponding ancestor nodes. Since the root Bloom filter is considerably
large, the adapted version of FNV which outputs 256 bits is 4
http://downloads.digitalcorpora.org/downloads/hashdb/hashdb_um.pdf (last
required. Finally, fhmap makes use of FNV-1 for hashing an input. accessed 2018-10-23).
S120 L. Liebler et al. / Digital Investigation 28 (2019) S116eS125
5
https://github.com/ishnid/mrsh-hbft. Fig. 2. Global-filter based prevention of multihits. Three different Bloom filter are used
6
https://github.com/skarupke/flat_hash_map, further extensions will be avail- to filter multihits: A local Bloom filter (BFL ), a global multihit Bloom filter (BFM ), and a
able via https://dasec.h-da.de/staff/lorenz-liebler/in April 2019. global Bloom filter (BFG ).
L. Liebler et al. / Digital Investigation 28 (2019) S116eS125 S121
7
performed on a laptop with Intel I7 2.2 GHz, 8 GiB of RAM and a SSD.
8
http://roussev.net/t5/t5.html (last accessed 2018-10-23). Fig. 4. Introducing wrong blocks at the borders of image blocks.
S122 L. Liebler et al. / Digital Investigation 28 (2019) S116eS125
rolling_hash function which subsequently defines the chunk drawback that hbft structures need to be partially rebuilt without the
boundaries within each block. Thus, each Pi contains 0 or more deleted file. In order to delete a specific file from the data structure,
chunks Ci . The challenge arises at each chunk boundary, i.e., the last all related nodes from a leaf node up to its final root node are
chunk in Pi and the first in Piþ1 (marked red). The chunk boundaries affected. Such affected filters need to be deleted and re-populated
are implicitly defined by the block boundaries and do not match again. Also file chunks which are represented by the affected
with the original chunk boundaries of S. The naive alignment of nodes need to be re-inserted again. Concerning the depicted tree
blocks, which would set the start address of a block to the end structure in Fig. 5, lastly, every file hash is needed since the root filter
address of a subsequent block (i.e., siþ1 ¼ ei ), would process chunks holds the block hashes for every file in the set. Splitting the root node
at the borders incorrectly (e.g., compared to non-parellized pro- into several filters would reduce the amount of recreated hashes
cessing of S). To produce consistent chunk boundaries in the par- from scratch. During a lookup, this would require to horizontally
allelized and non-parallized version, we could move the starting process a sequence of root-filters first. Fig. 5 describes the problem of
point into the range of a preceding block. This would create an deleting files in a hbft structure. Deleting a file from the tree comes
overlap which gives the rolling_hash a chance to re-synchronize. with considerable effort and computational overhead.
We could also run over the end ei until we identify a match with
the leading chunks of its successor (i.e., block processed by Piþ1 ). Evaluation
Evaluation rolling_hash. Our prototype implementation uses
the producer consumer paradigm. The image is read as a stream of The following performance tests focus on a runtime comparison
bytes by the main thread. Depending on the amount of cores the between hashdb, hbft and fhmap. Each phase ranging from creating
first image size=cpu num bytes are read. Those bytes are passed to a database to the actual lookup is measured individually. Beside the
a thread which performs the rolling hash algorithm and hashes overall Memory Consumption (4.2), we consider three major phases:
identified blocks. In parallel, the next (image size=cpu numÞþ Build Phase (4.3), Deduplication Phase (4.4) and Lookup Phase (4.5).
overlap bytes are read and processed by another thread. In Table 2 The assessment of required resources and performance limita-
times are shown needed to process a 2 GiB image into block hashes tions of the candidates should respect the proposed environmental
with and without threading. In the case of multithreading, the time conditions. In particular, the considered techniques are scaled for
needed for resynchronization is already included. Obviously the specific environments where hashdb explicitly targets large scale
needed CPU time will increase since it is spread over multiple systems with multiprocessing capabilities. Even if the presence of
threads. The actual elapsed time is remarkably lower. A speedup of an adequate infrastructure is a considerable assumption, we aim for
approximately 30 seconds can be achieved. This increases the similar evaluation conditions and therefore will limit resources (e.g.
processing speed by a factor of 3.2 by keeping consistent block the number of processing cores). Again, it should be clear that
boundaries. The query could be parallelized as well, but was not hashdb as a single-level store clearly stands out compared to our
implemented in the course of this work. memory-only candidates hbft and fhmap. However, we strive for a
comprehensive comparison in our introduced use case by including
Theoretical extensions a fully equipped database with desirable features.
Table 2
Times of building/hashing blocks of an 2 GiB image in seconds (s).
Singlethread Multithread (8 Threads) Fig. 5. Deletion of elements in a hbft could be performed in two steps. First: delete all
Bloom filters which were influenced by the deleted file in a top-down or bottom-up
Real 43.82 s 13.59 s
approach. Second: Reinsert block hashes affected by the deletion in a bottom-up
CPU 35.87 s 49.25 s
approach.
L. Liebler et al. / Digital Investigation 28 (2019) S116eS125 S123
Memory consumption Recalling the utilized set of random data, randomly generated
data does not feature any multi hits and thus, nearly all of the
The three approaches feature different memory requirements. extracted chunks result in unique hits. Our considered version of
After ingesting the 2 GiB test set hashdb produces files totaling in hashdb allows deduplication of an existing database per configu-
405.9 MB on disk. Since there is no theoretical background to ration. The associated counter value for all hashed blocks are
calculate the data structures size in main memory, it is approxi- checked and all blocks with a value higher than one are deleted. The
mated using the top command. The in memory size of the structure remaining chunks are written to a new database which finally
is about 900 MB. ensures unique matches during lookup. The size of the newly
In the case of fhmaps, the author9 mentions some storage established database stays the same. The in Section ‘Extensions to
overhead for the handling of key-value-pairs. The overhead will be hbft and fhmap’ introduced and implemented multihit mecha-
at a minimum of 8 bits per entry and will be padded to match with nisms of hbft and fhmap are executed in memory only. The runtime
the actual key length. Assuming a key length of 64 bit, then the results of the deduplication phase are displayed in Fig. 7a and b.
overhead for fhmap would be 64 bit as well. Assuming the test set Timings do not differ remarkably for single- or multi-threaded
of 2 GiB with a block size b ¼ 512, then the total amount of blocks n scenarios. The deduplication procedure for hashdb is slightly
will be 4; 194; 304. The total size s in main memory with a load slower since it needs to read the database from disk first. As the
factor of 0.5 would then result in s ¼ n,640:5 bits,3z200 MiB. The
rolling hash produces less blocks than a fixed-size extraction, the
allocation size on disk would be halved to approximately 100 MiB
caused by the load factor.
The size s of a hbft depends on various parameters and is mainly
influenced by the data set size, the block size b, and a desired false
positive rate fpr. In our scenario we approximate the root filter size
for the parameters ¼ 2 GiB, b ¼ 512 and fpr ¼ 106. This would lead
to an approximated root filter size of m1 ¼ m*215:84 z14MiB. The
tree consists of log2 ð4096Þ ¼ 12 levels. Thus, the total amount of
needed memory is approximately 12,14MiB ¼ 168MiB. The size on
disk will be approximately the same since the array and corre-
sponding Bloom filters need to be saved. Fur further details of the
hbft parametrization and calculation, we refer to Lillis et al. (Lillis
et al., Scanlon).
An overview of the required storage for each technique can be
Lookup phase
In this work we discussed and evaluated three different
implementations of artifact lookup strategies in the course of dig-
The lookup consists of splitting an image in blocks, hashing
ital forensics. Several extensions have been proposed to finally
those blocks, and query them against the database. As we are only
perform a comprehensive performance evaluation of hashdb, hbft,
interested in efficiency (but not in detection performance), we
and fhmap. We introduced concepts to handle multihits for hbft
make use of a simple approach to simulate full and partial detection
and fhmap by the implementation of deduplication and filtration
scenarios. We create four different images which are queried
features. Moreover, we interfaced fhmap with a rolling hash based
against the databases. All images have a fixed size of 2 GiB. Each of
extraction of chunks. For a better comparison to hashdb, we addi-
the four images is constructed to match either 100%, 75%, 50%, or
tionally parallelized the extraction of chunks.
25% of the database. Again we point out that the different file
Results show that fhmap outperforms hbft in most of the
matching sizes are only used to investigate the efficiency behaviour
considered performance evaluations. While hbfts are faster than
dependent on different matching rates, i.e., the matching rate is the
hashdb in nearly all evaluations, the concept introduces false pos-
input parameter. Images with matching rates below 100% are
itives by the utilized Bloom filters. Even if hbfts have small ad-
partially filled with random bytes to reach the desired size of 2 GiB.
vantages in case of memory and storage efficiency, their
The size of every inserted file is a multiple of b ¼ 512 bytes. Thus,
complexity, fixed parametrization, and limited scope of features
images are crafted which do not cause alignment issues for a fixed
make such an advantage negligible. However, specific use cases
block extraction. It has to be considered that the rolling hash
with tight memory constraints could make hbfts still valuable.
Discussions of hashdb in terms of performance should consider
the underlying concept of single-level stores. Shifting the discus-
sion to offered features and a long term usage with an ongoing
maintenance, hashdb and fhmap are more suitable. One thing to
note is that hashdb is the only implementation that is able to deal
with databases which do not fit into main memory. In addition it
supports transactional features.
In Table 4 a final comparison of all three candidates in terms of
performance and supported features is given. The final overview
underlines the trade-offs between the concepts, where fhmap
shows a constant performance in most of the mentioned categories.
Future work
Table 4 new algorithm mrsh-v2. In: International Conference on Digital Forensics and
Final comparison of hashdb/hbft/fhmap in case of performance and offered features. Cyber Crime. Springer, pp. 167e182.
Breitinger, F., Baier, H., White, D., 2014. On the database lookup problem of
approximate matching. Digit. Invest. 11, S1eS9. Supplement 1.0 (2014). Pro-
ceedings of the First Annual DFRWS Europe, ISSN: 1742-2876.
Breitinger, F., Rathgeb, C., Baier, H., 2014. An efficient similarity digests database
lookup-a logarithmic divide & conquer approach. The Journal of Digital Fo-
rensics, Security and Law: JDFSL 9, 155.
Celis, P., Larson, P.-A., Munro, J.I., 1985. Robin hood hashing. In: Foundations of
Computer Science (Ed.), 26th Annual Symposium on, IEEE, pp. 281e288.
Chu, H., 2011. Mdb: a memory-mapped database and backend for openldap. In:
Proceedings of the 3rd International Conference on LDAP, Heidelberg, Germany,
p. 35.
Fan, B., Andersen, D.G., Kaminsky, M., Mitzenmacher, M.D., 2014. Cuckoo filter:
practically better than bloom. In: Proceedings of the 10th ACM International on
Conference on Emerging Networking Experiments and Technologies. ACM,
pp. 75e88.
Foster, K., 2012. Using Distinct Sectors in Media Sampling and Full Media Analysis to
Detect Presence of Documents from a Corpus, Technical Report. NAVAL POST-
GRADUATE SCHOOL MONTEREY CA.
Garfinkel, S.L., McCarrin, M., 2015. Hash-based carving: searching media for com-
plete files and file fragments with sector hashing and hashdb. Digit. Invest. 14,
S95eS105.
Gupta, V., Breitinger, F., 2015. How cuckoo filter can improve existing approximate
matching techniques. In: International Conference on Digital Forensics and
Cyber Crime. Springer, pp. 39e52.
Harichandran, V.S., Breitinger, F., Baggili, I., 2016. Bytewise approximate matching:
the good, the bad, and the unknown. Journal of Digital Forensics, Security and
Concepts to close the gap between performance-oriented memo- Law 11, 4.
ry-resistant lookup strategies and transactional database are Liebler, L., Breitinger, F., 2018. mrsh-mem: approximate matching on raw memory
needed. dumps. In: International Conference on IT Security Incident Management and IT
Forensics. IEEE, pp. 47e64.
Lillis, D., Breitinger, F., Scanlon, M., 2017. Expediting mrsh-v2 approximate matching
Acknowledgment with hierarchical bloom filter trees. In: International Conference on Digital
Forensics and Cyber Crime. Springer, pp. 144e157.
Pagani, F., Dell'Amico, M., Balzarotti, D., 2018. Beyond precision and recall: under-
This work was supported by the German Federal Ministry of standing uses (and misuses) of similarity hashes in binary analysis. In: Pro-
Education and Research (BMBF) as well as by the Hessen State ceedings of the Eighth ACM Conference on Data and Application Security and
Ministry of Higher Education, Research and the Arts within CRISP Privacy. ACM, pp. 354e365.
Young, J., Foster, K., Garfinkel, S., Fairbanks, K., 2012. Distinct sector hashes for target
(www.crisp-da.de). file detection. Computer 45, 28e35.
References
Breitinger, F., Baier, H., 2012. Similarity preserving hashing: eligible properties and a