0% found this document useful (0 votes)
50 views10 pages

Digital Investigation: Lorenz Liebler, Patrick Schmitt, Harald Baier, Frank Breitinger

This paper discusses and compares three strategies for storing and looking up hash-based digital forensic data: hashdb, hierarchical Bloom filter trees (hbft), and flat hash maps (fhmap). It outlines the capabilities of each approach, discusses extending them to handle common blocks, and evaluates their runtime performance and applicability. The results show that fhmap has the best runtime performance and applicability, while hbft is also efficient but less extensible. Hashdb has the worst single-core performance but enables full parallelization.

Uploaded by

asmm.rahaman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views10 pages

Digital Investigation: Lorenz Liebler, Patrick Schmitt, Harald Baier, Frank Breitinger

This paper discusses and compares three strategies for storing and looking up hash-based digital forensic data: hashdb, hierarchical Bloom filter trees (hbft), and flat hash maps (fhmap). It outlines the capabilities of each approach, discusses extending them to handle common blocks, and evaluates their runtime performance and applicability. The results show that fhmap has the best runtime performance and applicability, while hbft is also efficient but less extensible. Hashdb has the worst single-core performance but enables full parallelization.

Uploaded by

asmm.rahaman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Digital Investigation 28 (2019) S116eS125

Contents lists available at ScienceDirect

Digital Investigation
journal homepage: www.elsevier.com/locate/diin

On efficiency of artifact lookup strategies in digital forensics


Lorenz Liebler a, b, *, Patrick Schmitt c, Harald Baier a, b, Frank Breitinger d
a
da/sec Biometrics and Internet Security Research Group, Hochschule Darmstadt, Darmstadt, Germany
b
CRISP, Center for Research in Security and Privacy, Darmstadt, Germany
c
Secure Software Engineering Group, Technische Universita€t Darmstadt, Darmstadt, Germany
d
Cyber Forensics Research and Education Group (UNHcFREG), University of New Haven, New Haven, USA

a r t i c l e i n f o a b s t r a c t

Article history: In recent years different strategies have been proposed to handle the problem of ever-growing digital
forensic databases. One concept to deal with this data overload is data reduction, which essentially
means to separate the wheat from the chaff, e.g., to filter in forensically relevant data. A prominent
Keywords: technique in the context of data reduction are hash-based solutions. Data reduction is achieved because
Database lookup problem hash values (of possibly large data input) are much smaller than the original input. Today's approaches of
Artifact lookup
storing hash-based data fragments reach from large scale multithreaded databases to simple Bloom filter
Approximate matching
representations. One main focus was put on the field of approximate matching, where sorting is a
Carving
problem due to the fuzzy nature of the approximate hashes. A crucial step during digital forensic analysis
is to achieve fast query times during lookup (e.g., against a blacklist), especially in the scope of small or
ordinary resource availability. However, a comparison of different database and lookup approaches is
considerably hard, as most techniques partially differ in considered use-case and integrated features,
respectively. In this work we discuss, reassess and extend three widespread lookup strategies suitable for
storing hash-based fragments: (1) Hashdatabase for hash-based carving (hashdb), (2) hierarchical Bloom
filter trees (hbft) and (3) flat hash maps (fhmap). We outline the capabilities of the different approaches,
integrate new extensions, discuss possible features and perform a detailed evaluation with a special focus
on runtime efficiency. Our results reveal major advantages for fhmap in case of runtime performance and
applicability. hbft showed a comparable runtime efficiency in case of lookups, but hbft suffers from
pitfalls with respect to extensibility and maintenance. Finally, hashdb performs worst in case of a single
core environment in all evaluation scenarios. However, hashdb is the only candidate which offers full
parallelization capabilities, transactional features, and a Single-level storage.
© 2019 The Author(s). Published by Elsevier Ltd on behalf of DFRWS. This is an open access article under
the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Introduction problem, different techniques have been discussed such as multiple


Bloom filters, single large Bloom filters, Cuckoo filters or Hierar-
Approximate matching (a.k.a. fuzzy hashing or similarity hash- chical Bloom filter trees (Harichandran et al., 2016; Lillis et al.,
ing) is a common concept across the digital forensic community to 2017).
do known file/block identification in order to cope with the large Besides the complexity, an investigator has to deal with com-
amounts of data. However, due to the fuzzy nature of approximate mon blocks which makes the identification of the correct match
hashes current approaches suffer from the Database Lookup Prob- hard (Garfinkel and McCarrin, 2015). The extraction and correct
lem (Breitinger et al., 2014a). This problem is based on the decision, assignment of a specific data fragment is of crucial importance, i.e.,
if a given fingerprint is member of the reference dataset. The those identified chunks allow the inference of the original source
general database lookup problem is of complexity OðnÞ in terms of (e.g., a potentially malicious file or a media file). Specificially,
the number of queries and hence exponential. To address this different files of the same type or application often share a non-
negligible amount of common blocks, e.g., file structure elements
in the file header. This leads to multihits, hence, those blocks or
* Corresponding author. da/sec Biometrics and Internet Security Research Group,
chunks are not suitable for a unique identification of a specific file.
Hochschule Darmstadt, Darmstadt, Germany. To avoid this problem, lookup strategies should consider additional
E-mail address: [email protected] (L. Liebler).

https://doi.org/10.1016/j.diin.2019.01.020
1742-2876/© 2019 The Author(s). Published by Elsevier Ltd on behalf of DFRWS. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/
licenses/by-nc-nd/4.0/).
L. Liebler et al. / Digital Investigation 28 (2019) S116eS125 S117

mechanisms to handle common blocks, e.g., by integrating func- 2. We inspect the feasibility and discuss concepts to integrate the
tions of filtration or deduplication. Those requirements influence missing feature of multihit prevention (filtration of common
the applicability of a specific lookup strategy. The consideration of blocks) similar to hashdb, into hbft or fhmap.
common blocks influences the results of higher level analysis, too 3. Discuss possible extensions of existing techniques in order to be
(e.g., approximate matching used for the task of identifying similar able to compare the approaches.
binaries or detecting shared libraries (Liebler and Breitinger, 2018; 4. Assess how the different approaches compete with respect to
Pagani et al., 2018)). runtime performance and resource usage.
In this work we discuss, reassess and extend three widespread
lookup strategies suitable for storing hash-based fragments which Our result is that fhmap is best in case of runtime performance
have been proposed and are currently utilized in the field of digital and applicability. hbft showed a comparable runtime efficiency in
forensics: case of lookups, but hbft suffers from pitfalls with respect to
extensibility and maintenance. Finally, hashdb performs worst in
1 hashdb: In 2015 Garfinkel and McCarrin (2015) introduced Hash- case of a single core environment in all evaluation scenarios,
based carving, “a technique for detecting the presence of specific however, it is the only candidate which offers full parallelization
target files on digital media by evaluating the hashes of indi- capabilities and transactional features.
vidual data blocks, rather than the hashes of entire files”. The remainder of this work is structured as follows: In Section
Common blocks were identified as a problem and have to be ‘Candiate and features analysis’ we give a short introduction in our
handled or filtered out, as they are not suitable for identifying a considered use case. In addition, an overview of the depicted
specific file. To handle the sheer amount of digital artifacts and evaluation candidates and their already integrated features is given.
to perform fast and efficient queries, the authors utilized a so In Section ‘Extensions to hbft and fhmap’ we describe our proposed
called hashdb. The approach was integrated into the bulk_ex- extensions and our performed evaluation of those. We give a
tractor forensic tool. Both implementations have been made detailed performance evaluation and discuss advantages and dis-
publicly available.1 advantages of the different techniques in Section ‘Evaluation’.
2 hbft: In the scope of approximate matching, probabilistic data Finally, we conclude this work.
structures have been proposed to reduce the amount of needed
memory for storing relevant artifacts. Approaches to store ar- Candidates and feature analysis
tifacts comprise multiple Bloom filters (Breitinger and Baier,
2012), single Bloom filters 8 or more exotic Cuckoo filters (Fan As all of the mentioned approaches strongly differ (either in
et al., 2014; Gupta and Breitinger, 2015). One major problem of their original use case, or in their supported capabilities) we outline
probabilistic data structures is the fact of losing the ability to the motivation behind our decision of the depicted candidates.
actually identify a file. In 2014 Breitinger et al. (2014b) provided Therefore, we first describe the conditions of application and
a theoretical concept of structured Bloom filter trees for iden- introduce the forensic use case which formalizes additional re-
tifying a file. In 2017, a more detailed discussion and concrete quirements and mandatory features (see Section ‘Use case and re-
implementation was provided by Lillis et al. (2017). The quirements’). Afterwards, we explain our three candidates of choice
approach is based on “the well-known divide and conquer in Section ‘Depicted Candidates’. Beside the required features of our
paradigm and builds a Bloom filter-based tree data structure in considered use case, we need to discuss the already present fea-
order to enable an efficient lookup of similarity digests”. This tures and capabilities of the different approaches. Thus, we outline
leads to Hierarchical Bloom filter trees (hbft). the existing features and capabilities for each candidate in Section
3 fhmap: Recently Malte Skarupke presented a fast hash table ‘Feature Analysis’.
called flat_hash_map2 (fhmap). The author claims that the
implementation features the fastest lookups until now. A hash Use case and requirements
table features a constant lookup complexity of Oð1Þ given a good
hash function. The database implementation provides an In this work we address the problem of querying digital artifacts
interface for accessing the hash table itself, however, it does not out of a large corpus of relevant digital artifacts. Sample applica-
feature any image slicing, chunk extraction or hashing. Thus, in tions are carving or approximate matching. Both applications suffer
order to utilize and evaluate fhmap in our context, it has to be from the Database Lookup Problem, i.e., how to link an extracted
extended by additional concepts to extract data fragments artifact within a forensic investigation to a corresponding source of
comparable to Hash-based carving or fuzzy hashing. a forensic corpus efficiently (i.e., in terms of required storage ca-
pacity, required memory capacity or lookup performance). Beside
Considering our depicted candidates and the overall goal of those, our digital forensic scenarios bare additional pitfalls and
reassessing those in terms of capabilities and performance, the challenges.
goals of this paper are as follows: We consider the extraction of chunks (i.e., substrings) out of a
raw bytestream, without the definition of any extraction process. A
1. Assess the aforementened proposed techniques for the task of major challenge of matching an artifact to a source are occurring
fast artifact handling (i.e., hbft, fhmap and hashdb). Identify the multihits, i.e., one chunk is linked to multiple files. This was first
capabilities of those techniques and the possible handling of mentioned by Foster (2012). A multihit is also called a common
common blocks. block or a non-probative block, too. For instance Microsoft Office
documents such as Excel or Word documents share common byte
blocks across different files (Garfinkel and McCarrin, 2015). Similar
problems occur during the examination of executable binaries
which have been statically linked (e.g., share a large amount of
1
https://github.com/simsong/(last accessed 2018-10-23).
2
common code or data). Summarized, multi matches are a challenge
You Can Do Better than std::unordered_map: New and Recent Improvements to
Hash Table Performance presented by Malte Skarupke at CþþNow in 2018; https://
for identifying an unknown fragment with full confidence. In
probablydance.com/2018/05/28/a-new-fast-hash-table-in-response-to-googles- addition, storing multihits also increases memory requirements
new-fast-hash-table/(last accessed 2018-10-23). and decreases lookup performance.
S118 L. Liebler et al. / Digital Investigation 28 (2019) S116eS125

Multihits can either be identified during the construction phase performed regulary. The read-only memory and the filesystem are
of the database (e.g., by deduplication or filtration) or during the kept coherent through a Unified Buffer Cache. The size is restricted
lookup phase. By filtration of common blocks during a construction by the virtual address space limits of an underlying architecture. As
phase, the overall database load gets reduced and the lookup speed mentioned by Chu (2011), on a 64 bit architecture which supports
is increased as only unique hits are considered. Two different 48 addressable bits, this leads to an upper bound of 128 TiB of the
strategies of multihit prevention during the construction phase database (i.e., 47 bits out of 64 bits).
were proposed. First, as introduced by Garfinkel and McCarrin hbft. The concept of hierarchical Bloom filter trees (hbft) is fairly
(2015) rules are defined to filter out known blocks with a high new. This theoretical concept was introduced by Breitinger et al.
occurrence (and thus low identification probability of an individual (2014b) and later implemented by Lillis et al. (Lillis et al., Scan-
file). Such an approach requires extensive pre-analysis of the input lon). The lookup differs from the approximate matching algorithm
set and its given structures. A second approach is the filtration of mrsh-v2, as hbft only focuses on fragments to identify potential
common blocks during construction by the additional integration buckets of files. A parameter named min_run describes how many
of a deduplication step. Beside hashdb, none of our candidates consecutive chunk hashes need to be found to emit a match. A good
provide deduplication or multihit prevention techniques so far. We recall rate was accomplished for min_run ¼ 4. The tree structure is
refer to Garfinkel and McCarrin (2015) for further details and solely then traversed further if a queried file is considered a match in the
focus on the utilized database in the following subsection. root node. Each of the nodes is represented by a single Bloom filter
Features of adding and deleting artifacts have the major benefit which empowers to traverse the tree. A traditional pairwise com-
of not needing to re-generate the complete database everytime a parison can be done at possible matching leaf nodes. For details of
new artifact needs to be included. While deleting inputs may be the actual traversing concept we refer to the original paper (Lillis
less frequent, adding new items to an existing storage scheme et al., Scanlon).
seems obvious and indispensable. While fhmap and hashdb support Just like previous mrsh implementations the lookup structure
adding and deleting hashes from their scheme, this feature is not can be precomputed in advance. First, the tree is constructed with
yet available in the current prototype of hbft. In detail, adding new its necessary nodes. Then the database files are inserted. Thus, the
elements to a hbft is possible, however, the tree needs to be re- time for construction can be neglected in the actual comparison
generated as soon as a critical point of unacceptable false posi- phase. More precisely the tree structure is represented as a space
tives is reached. The definition of buckets also limits the capabilities efficient array where each position in the array points to a Bloom
to add further files to the database. Loosing the capabilities of de- filter. The implementation uses a bottom up construction which
leting elements out of a binary Bloom filter is the main reason for fills trees from the leaf nodes to the root. The array representation
making features of deletion impossible to realize. Summarized, does not store references to nodes, its children, or leaves explicitely.
adding and deleting hashes from a database is a mandatory or Every reference needs to be calculated depending on the index in
optional feature, depending on the specific use case. the array. Efficient index calculations are only applicable for binary
trees. The lookup complexity within the tree structure is Oðlogx ðnÞÞ,
Depicted candidates where x describes the degree of the tree and n is the file set size.
fhmap. Flat hash maps have been introduced as fast and easy to
In this section we present three widespread lookup strategies realize lookup strategies. Up to now, they have been mainly dis-
suitable for storing hash-based fragments: (1) Hashdatabase for cussed in different fields of application. Similar to hbft, the actual
hash-based carving (hashdb), (2) hierarchical Bloom filter trees implementation of fhmap represents a proof of concept imple-
(hbft) and (3) flat hash maps (fhmap). mentation with good capabilities but limited features.
LMDB/hashdb. To store the considered blocks Garfinkel and The concept of flat hash maps is an array of buckets which
McCarrin (2015) make use of hashdb, a database which provides contain multiple entries. Each entry consists of a key-value pair.
fast hash value lookups. The idea of hashdb is based on Foster The key part represents the identifier for the value and is usually
(2012) and Young et al. (2012). In 2018, the current version (3.13) unique in the table. An index of the bucket is determined by a
introduces significant changes compared to the original version hash function and a modulo operation. The position i equals to:
mentioned by Garfinkel and McCarrin (2015). hashðkeyÞmod sizeðtableÞ. A large amount of inserts into a small
The former implementation of hashdb originally supported B- table causes collisions, where multiple items are inserted in the
Trees. Those have been replaced by the Lightning Memory Mapped same bucket. A proper hash function needs to be chosen in order to
Database (LMDB) which is a high-performance and fully trans- maintain a lookup complexity of Oð1Þ. The function needs to spread
actional database (Chu, 2011). It is a key-value store based on the entries without clustering. The amount of inserted items is
B þ Trees with shared-memory features and copy-on-write se- denoted by the load factor, i.e., the ratio of entries per buckets. A
mantics. The database is read-optimised and can handle large data high load factor obviously causes more collisions. The table gets
sets. The technique originally focused on the reduction of cache slower since the buckets have to be traversed to find the correct
layers by mapping the whole database entirely into memory. Direct entry. The lower the load factor the faster the table. However, more
access to the mapped memory is established by a single address memory is required since buckets will be left empty on purpose. If a
space and by features of the operating system itself. Storages are slot is full the entries are re-arranged.
considered as primary (RAM) or secondary (disk) storages. Data The whole table is implemented as a contiguous (flat) array
which is already loaded can be accessed without a delay as the data without buckets which allows fast lookups in memory. With linear
is already referenced by a memory page. Accessing not-referenced probing the next entry in the table is checked if its free. If not the
data triggers a page-fault. This in turns leads the operating system next one is checked until either a free slot is found or the upper
to load the data without the need of any explicit I/O calls. Sum- probing limit is reached. The table is re-sized as soon as a defined
marised, the fundamental concept behind LMDB is a single-level limit is reached. The default load factor of this table is 0.5. Specific
store, the mapping is read-only and write operations are features should speed up the lookup phase: open addressing, linear
probing, Robin Hood hashing, prime number of slots and an upper
limit probe count. Robin Hood hashing introduced by (Celis et al.,
3
http://downloads.digitalcorpora.org/downloads/hashdb/hashdb_um.pdf (last 1985) ensures that most of the elements are close to their ideal
accessed 2018-10-23). entry in the table. The algorithm rearranges entries: elements
L. Liebler et al. / Digital Investigation 28 (2019) S116eS125 S119

Table 1
Features of hashdb, hbft and fhmap. New implementations are marked with asterisks (symbol *, table cell is coloured green) and potential techniques are marked with a caret
(symbol ^, table cell is coloured red).

which are very far will be positioned closer to their original slot, Multithreading support. With respect to multithreading sup-
even if it is occupied by another element. The element which oc- port hashdb (or Lightning Memory Mapped Database) allows mul-
cupies this specific slot will also be rearranged from its possibly tithreaded reading operations to improve the runtime
ideal slot to enable approximately equal distances for each element performance. However, up to now no theoretical or practical con-
to its ideal position in the table. The algorithm takes slots from rich cepts are available to integrate mulithreading support into hbft and
elements, which are close to their ideal slot, and gives those slots to fhmap, respectively. Nevertheless the block building phase may use
the poor, elements which are very far away, hence the name. multithreading for both hbft and fhmap.
Multihit handling. hashdb associates hashed blocks with meta
Feature analysis data. A meta data describes the count of matching files for a specific
block. The saved counts are used to remove duplicates from a
In what follows we shortly inspect the capabilities and proper- database and to rule out multi matches in advance. In detail, all
ties of all considered techniques. We discuss the current state of hashed blocks are removed which have an associated counter value
each approach in the case of existing features and properties. higher than one (Garfinkel and McCarrin, 2015). The current pro-
Table 1 provides a summary of the discussed capabilities. Note, the totypes of both hbft and fhmap do not provide any functionality of
marks (*) and (^) mean that these attributes are introduced or deduplication. Thus, a feature for filtering common blocks and
discussed, respectively, in the course of this work. For instance, an common shared chunks is missing. We address this problem in
important extension in the case of fhmap is the integration of an Section 3 and extend both prototypes.
appropriate chunk extraction and insertion technique. Add and remove hashes. hashdb supports adding new hashes
Block Building. In case of hashdb the database building and into a database (with integrated deduplication). First the new files
scanning of images is now possible without the use of bulk extractor are split into blocks and hashed. Afterwards, hashes are inserted
which was originally proposed to extract chunks. It builds and into the database. The implementation also supports deletion of
hashes the blocks with a fixed sliding window which shifts along a hashes from a given database by subtracting a database from the
fixed step size s. Obviously this produces quite a lot of block hashes original. In order to insert new files into an existing hbft database,
to be stored in the database. Similar to the original mrsh-v2 algo- the tree needs to be rebuilt with the new file hashes if the tree was
rithm, the current hbft implementation identifies chunks by the limited for the original file set. One can save the original block
usage of a Pseudo-Random-Function (PRF). As soon as the current hashes in order to avoid rehashing. The current concept of the
byte input triggers a previously defined modulus a new chunk structure does not provide any functionality for deleting a given
boundary is defined. The current implementation sticks to the chunk hash. This is obviously primarily caused by the nature of
originally proposed rolling_hash (Breitinger and Baier, 2012). Thus, Bloom filters. The original implementation of fhmap supports
the extraction of chunks relies on the current context of an input adding and deleting entries by default. After determining the index
sequence and not on a previously defined block size. Those in the table, the entries are re-arranged as soon as a slot is full. After
Context-Triggered Piecewise-Hashing (CTPH) algorithms prevent adding or deleting, the table is optionally re-sized to a final load
issues by changing starting offsets of an input sequence. The fixed factor of 0.5.
defined modulus bm approximates the extracted block size in Prefiltering of non-matches. hashdb provides prechecking by a
average. Unlike hashdb or hbft, the fhmap implementation itself Hash Store. A Hash Store is described as a highly compressed opti-
does obviously not feature any block building or hashing. As it will mized store of all block hashes in the database.4 In case of hbft the
just serve as a container, we have to extend the capabilities to root Bloom filter provides an easy discrimination between a match
extract, hash and store fragments during evaluation. and a non-match and thus, yielding prefiltering of non-matches.
Block-Hashing Algorithm. In case of hashdb the authors pro- The current version of fhmap does not provide any prechecking
pose the cryptographic hash function MD5 and an initial blocksize b or prefiltering mechanisms. The additional implementation of a
of 4 KiB. A value which is obviously inspired by the common cluster Bloom filter may solve this problem and is object of future research.
size of todays’ file systems. hbft makes use of the FNV hash function False Positives. Similar to a probabilistic lookup strategy like
to hash its chunks and sets 5 bits in its leaf nodes and the corre-
sponding ancestor nodes. Since the root Bloom filter is considerably
large, the adapted version of FNV which outputs 256 bits is 4
http://downloads.digitalcorpora.org/downloads/hashdb/hashdb_um.pdf (last
required. Finally, fhmap makes use of FNV-1 for hashing an input. accessed 2018-10-23).
S120 L. Liebler et al. / Digital Investigation 28 (2019) S116eS125

Bloom filters, the currently proposed Hash Store of hashdb causes


false positives. Even if the concept of prechecking produces false
positives, hashdb still performs a complete lookup of the queried
hash value. Thus, the overall lookup does not suffer from any false
positives. A major disadvantage of utilizing a probabilistic lookup
strategy like in case of hbft is the possible collision of lookups. Thus,
the lookup strategy suffers from false positives. The expected value
of false positives is controlled by the size of the root Bloom filter
and the handled amount of inserts. As fhmap performs a full lookup
on the stored hash values, the approach does not suffer from any
possible false positives.
Limited to RAM. The current implementation of hashdb in-
tegrates capabilities of loading and storing entries from and to disk.
Thus, the approach is not limited to any memory boundaries. The Fig. 1. Tree-filter based multihit prevention with temporary hbfts per file. Temporary
overall design and construction of the hbft tree heavily relies on the Bloom filter trees (TBFi ) are used to filter multihits for a current file Fi and all its
subsequent files. A global hbft (GBF) represents all unique elements in the processed
memory constraints. The original implementation was created as a
set.
RAM-resident solution only. The proposed parametrization and
initialization mainly focuses on memory boundaries of a target
system. In case of fhmap, the structure of the contiguous array is counter. After processing all of the subsequent files, unique chunks
directly created in memory and thus, the current approach is of the current tree are saved into a global hbft. The next file gen-
limited to the given memory boundaries of system. erates a temporary hbft again. However, only chunks with a zero
Persistent Database The database of hashdb is persistent. The counter are considered during the pass. The final tree stores unique
current implementation of hbft offers the possibility to save and chunks in different leaf nodes. Thus, it could be guaranteed that the
load a database to and from disk. The recent prototype of fhmap was tree does not feature the same chunk in a different leaf node.
only proposed as a simple proof of concept. No features for saving Obviously, this approach has a higher building time but offers
or restoring a disk-based database have been considered so far. additional features. A possible advantage of utilizing counters for
each chunk could be the definition of a threshold for accepted
Extensions to hbft and fhmap multihits. Thus, by the definition of a threshold a tree is generated
which stores the maximum amount of elements in each leaf node.
In this section we discuss extensions of our candidates hbft and This empowers to identify chunks related to multiple files.
fhmap with respect to both already implemented and potential Considering an interleaved multihit between two unique hits, an
future ones. The extended code is available for both hbft5 and investigator could infer the gap between both.
fhmap.6 In contrary, hashdb already fulfills most of the required Global-filter based. A straightforward approach could be the
capabilities and is the only ready-to-use approach for our consid- utilization of a separate Bloom filter which represents all multihits
ered context. In Section ‘Multihit Prevention fhmap’ we first discuss for a target file set. Therefor, two Bloom filters are generated with
strategies for the handling of multihits in the case of hbft, bench- an adequate size (i.e., a size which respects an upper bound of false
mark them, and depict a proper candidate. In Section ‘Chunk positives).
Extraction hbft and fhmp’ we discuss the possible multihit pre- As shown in Fig. 2, a single global Bloom filter stores all chunks
vention via deduplication for fhmap. The integration of persistence of a file set. A second global multihit Bloom filter will store all
for fhmap will not be outlined in detail in the course of this work. In multihits for a corresponding set. A temporary local Bloom filter is
Section 3.3 we discuss the chunk extraction process in the case of generated for a specific file and gets zeroed out before another file
hbft and fhmap. We will additionally introduce a concept for the will be processed. The local filter empowers to distinguish multihits
parallelization of the chunk extraction via a rolling hash function. within a file itself. Recalling the informal definition of a multihit, a
Finally, we will outline in Section ‘Theoretical Extensions’ some multihit within a file itself but with no matches in other files could
theoretically extensions and thoughts. still be used for unique identification and is desired to keep. The
local filter emits possible multihits on a file base. Those are ignored
Multihit prevention hbft and not further processed in the global filter. If a chunk is neither in
the local filter, nor in the global filter, it will be inserted in the global
The prevention of multihits (i.e., the filtration of common filter. We consider such a chunk as a unique chunk until proven
blocks) could be differentiated at the construction or at the lookup otherwise. If a chunk is already in the global filter, an identical
phase. In the following we discuss two approaches of multihit chunk has been seen before. Such a chunk will be further consid-
prevention, one realized during the construction phase and one ered as multihit and gets inserted into the multihit filter. This
during the lookup phase. process gets repeated for each file. The result is a global filter which
Tree-filter based. By the utilization of a temporary hbft for each stores all occurred multihits. A set (including multihits) could be
file, multihit chunks could be marked during the construction
phase. Each file is processed sequentially one after another. As can
be seen in Fig. 1, a temporary hbft stores the chunks of a currently
selected file. The following files are compared against the tempo-
rary hbft. A multihit is highlighted within the tree by a counter. The
currently compared chunk of a processed file is also labeled with a

5
https://github.com/ishnid/mrsh-hbft. Fig. 2. Global-filter based prevention of multihits. Three different Bloom filter are used
6
https://github.com/skarupke/flat_hash_map, further extensions will be avail- to filter multihits: A local Bloom filter (BFL ), a global multihit Bloom filter (BFM ), and a
able via https://dasec.h-da.de/staff/lorenz-liebler/in April 2019. global Bloom filter (BFG ).
L. Liebler et al. / Digital Investigation 28 (2019) S116eS125 S121

actual key. The value of the corresponding key is a reference to the


filename the chunk originates from. We assume that the corre-
sponding chunks are multihits if two hash values are identical. In
order to rule them out in the final database, each chunk will be
looked up first. If a key is not in the table it can be safely inserted. If
a key is already present in the table it is very likely a multihit. As
already explained, multihits which occur in a file itself, but not in
any other, should be kept. If the found value of the key (i.e., the
filename) is equal to the value of the key which needs to be
inserted, it is a multihit in a file itself. The insertion is ignored since
the chunk is already represented in the table. If the found value is
not equal to the query value, the chunk is a multihit within the file
Fig. 3. Benchmark of multihit prevention approaches for hbft. set. The entry will be marked as a duplicate in the table. This pro-
cedure is done for all chunks in the file set and comparable.
In the second processing step each entry is checked for the
stored in a global hbft with an additional check against the multihit duplicate mark. If a mark is found the entry will be deleted from the
filter. An additional step of deduplication could shift the prevention hash table. This algorithm also reduces the amount of inserted
of multihits to the construction phase. chunks while keeping unique insertions only. Fewer insertions also
Evaluation and selection. We benchmarked7 the two intro- mean less collisions and the increase of lookup speed. We will
duced approaches in the case of construction and lookup runtime. inspect the runtime, also in terms of deduplication, in the following
For our benchmark we make use of the t5-corpus8 and each node in section. Keeping the multihits is possible and would be comparable
the tree will represent one file of the corpus. The corpus consists of to the concept of hashdb which keeps internal counters to inserted
4457 files with a total size of approximately 1.9 GiB. The interface chunks. However, this would force the database to handle multihits
was extended to additionally handle a single large image as input. during the lookup phase.
We constructed an image by concatenating all 4457 files of the set.
The amount of extracted chunks by the utilization of the roll- Chunk extraction hbft and fhmp
ing_hash with a block size b ¼ 160 produces 8,311,785 chunks. In
total 457,793 of the chunks are multihits (i.e., shared in more than Chunk extraction means the following. A given input sequence
one file). of bytes is divided into chunks by the definition of a fixed modulus
Fig. 3 outlines the results of the benchmark. The construction b (common values are 64  b  320 bytes). The extraction algo-
phase does not differ for the file or image lookup. The prevention is rithm iterates over the input stream in a sliding window fashion,
handled during construction phase, as we focus on an improved rolls through the sequence byte-by-byte, and processes 7 consec-
lookup performance. The deduplication handling causes longer utive bytes at a time. A current window is hashed with a roll-
construction times in both cases. The tree filter approach clearly ing_hash function, which returns a value between 0 and b. If this
outreaches the global filter approach in terms of construction time. value hits the value b  1, a trigger point is found and thus, defines
Depicting an appropriate candidate (i.e., Tree-filter or Global-filter) the boundary of a current chunk.
is a trade off between performance and matching capabilities. By We consider an example of block building and querying an
the utilization of a global filter, we loose the ability to match image of 2 GiB. Reading in the image, constructing the block hashes
multihits to their root files. via rolling_hash, and hashing the blocks via FNV-256 takes
Lookup timings are identical for both techniques as both ap- approximately 43 seconds. Querying each chunk against the hbft
proaches filter multihits before the construction of the tree. Once takes approximately 8.5 seconds (with each chunk present in the
the search for a file hits a leaf node it can be assumed that the query hbft). Thus, the extraction without lookup takes about 83.4% of the
will not match any other leaf nodes. This holds true considering the overall query time for a single process.
false positives caused by the probabilistic Bloom filters. Even if it Beside the evaluation of lookup strategies, we also discuss the
depends to the overall use case how often a filter has be rebuilt possible parallelization of the extraction process itself. A possible
from scratch, the time needed to construct a Tree-filter becomes parallelized version of the rolling_hash is depicted in Fig. 4. We
unbearable for larger data sets. Therefore, this enhancement will evenly split the input S into parts Pi which start at byte si (offset)
not be pursued any further and we propose the usage of a Global- and end at ei (offset). The splitted blocks are further processed by a
filter.

Multihit prevention fhmap

Originally and naturally hash tables do not feature our consid-


ered handling of multihits by design. Each of the inserted keys
should be unique. Implementations behave differently upon mul-
tihit insertions. Some databases will simply overwrite existing
values, while others won't insert the value at all as soon as a key is
already occupied. However, there are hash tables which support
duplicated keys. This section presents an algorithm which prevents
multihit insertions in any hash table.
Each inserted chunk is represented by a hash value, i.e., the

7
performed on a laptop with Intel I7 2.2 GHz, 8 GiB of RAM and a SSD.
8
http://roussev.net/t5/t5.html (last accessed 2018-10-23). Fig. 4. Introducing wrong blocks at the borders of image blocks.
S122 L. Liebler et al. / Digital Investigation 28 (2019) S116eS125

rolling_hash function which subsequently defines the chunk drawback that hbft structures need to be partially rebuilt without the
boundaries within each block. Thus, each Pi contains 0 or more deleted file. In order to delete a specific file from the data structure,
chunks Ci . The challenge arises at each chunk boundary, i.e., the last all related nodes from a leaf node up to its final root node are
chunk in Pi and the first in Piþ1 (marked red). The chunk boundaries affected. Such affected filters need to be deleted and re-populated
are implicitly defined by the block boundaries and do not match again. Also file chunks which are represented by the affected
with the original chunk boundaries of S. The naive alignment of nodes need to be re-inserted again. Concerning the depicted tree
blocks, which would set the start address of a block to the end structure in Fig. 5, lastly, every file hash is needed since the root filter
address of a subsequent block (i.e., siþ1 ¼ ei ), would process chunks holds the block hashes for every file in the set. Splitting the root node
at the borders incorrectly (e.g., compared to non-parellized pro- into several filters would reduce the amount of recreated hashes
cessing of S). To produce consistent chunk boundaries in the par- from scratch. During a lookup, this would require to horizontally
allelized and non-parallized version, we could move the starting process a sequence of root-filters first. Fig. 5 describes the problem of
point into the range of a preceding block. This would create an deleting files in a hbft structure. Deleting a file from the tree comes
overlap which gives the rolling_hash a chance to re-synchronize. with considerable effort and computational overhead.
We could also run over the end ei until we identify a match with
the leading chunks of its successor (i.e., block processed by Piþ1 ). Evaluation
Evaluation rolling_hash. Our prototype implementation uses
the producer consumer paradigm. The image is read as a stream of The following performance tests focus on a runtime comparison
bytes by the main thread. Depending on the amount of cores the between hashdb, hbft and fhmap. Each phase ranging from creating
first image size=cpu num bytes are read. Those bytes are passed to a database to the actual lookup is measured individually. Beside the
a thread which performs the rolling hash algorithm and hashes overall Memory Consumption (4.2), we consider three major phases:
identified blocks. In parallel, the next (image size=cpu numÞþ Build Phase (4.3), Deduplication Phase (4.4) and Lookup Phase (4.5).
overlap bytes are read and processed by another thread. In Table 2 The assessment of required resources and performance limita-
times are shown needed to process a 2 GiB image into block hashes tions of the candidates should respect the proposed environmental
with and without threading. In the case of multithreading, the time conditions. In particular, the considered techniques are scaled for
needed for resynchronization is already included. Obviously the specific environments where hashdb explicitly targets large scale
needed CPU time will increase since it is spread over multiple systems with multiprocessing capabilities. Even if the presence of
threads. The actual elapsed time is remarkably lower. A speedup of an adequate infrastructure is a considerable assumption, we aim for
approximately 30 seconds can be achieved. This increases the similar evaluation conditions and therefore will limit resources (e.g.
processing speed by a factor of 3.2 by keeping consistent block the number of processing cores). Again, it should be clear that
boundaries. The query could be parallelized as well, but was not hashdb as a single-level store clearly stands out compared to our
implemented in the course of this work. memory-only candidates hbft and fhmap. However, we strive for a
comprehensive comparison in our introduced use case by including
Theoretical extensions a fully equipped database with desirable features.

This subsection will continue with an analysis of possible ex-


Testsystem and testdata
tensions which have not been integrated.
New file insertion in hbft. Adding new elements to a hbft is
Testsystem. All of the tests were performed on a laptop with
considerable easy, as long as the tree does not reach a critical load
Ubuntu 16.4 LTS using an underlying ext4 filesystem. The machine
factor. A naive approach is the initial design of an oversized tree
features an Intel I7 Processor with 2.2 GHz, 8 GB RAM, a built-in
structure to create additional empty leaf nodes. A parameterization
HDD, and a built-in SSD. After each evaluation run the memory
always has to consider the impact on the overall false positive rate.
and caches have been cleared in order to avoid run time or storage
The new files can then be splitted into blocks and hashed into the
benefits in a subsequent evaluation pass (i.e., we make use of a
tree. The already introduced Global-filter can be used to filter
‘cold’ machine). Each test was repeated three times and the results
multihits with low dependencies. Therefore, only the original
have been averaged. Building and lookup phase are influenced by
global and multi Bloom filters have to be updated and saved after a
the underlying storage drive. A benchmark of both drives reported
finished session. With a further growing amount of additional files
a read data rate of 128 MiB per second for the HDD and 266 MiB per
being inserted, the empty leaf node pool will starve and the false
second for the SSD. Thus, reading a 2 GiB file into memory takes
positive rate for the structure will become unacceptable. At this
approximately 16 seconds from the HDD and 8 seconds from the
point the database needs to be resized and rebuilt. Original block
SSD. We further used the SSD throughout the following tests. Both
hashes can be saved to disk to shorten the build time. However, in
drives have been benchmarked using the linux tool hdparm.
many cases, this suggestion is infeasible. Adding additional files to
Testdata. Tests are performed on random data and synthetic
the tree requires careful and storage-intensive pre-planning. Most
images. The considered file set consists of 4096 files. Each file has a
of the times the database is optimized for a given file set. Intro-
size of 524,288 bytes totaling in an image of 2 GiB. The image was
ducing empty leaf nodes possibly adds additional levels to the tree.
created by concatenating all files together. We depict a global
This pessimistic growth would slow down the lookup phase and
blocksize b of 512 bytes for all candidates and all tests. This should
adds memory overhead which indeed never could be required.
lead to a comparable compression rate and equal treatment in
File deletion in hbft. Hashes cannot be deleted from Bloom fil-
ters (except Counting Bloom filters). This in turn leads to the major

Table 2
Times of building/hashing blocks of an 2 GiB image in seconds (s).

Singlethread Multithread (8 Threads) Fig. 5. Deletion of elements in a hbft could be performed in two steps. First: delete all
Bloom filters which were influenced by the deleted file in a top-down or bottom-up
Real 43.82 s 13.59 s
approach. Second: Reinsert block hashes affected by the deletion in a bottom-up
CPU 35.87 s 49.25 s
approach.
L. Liebler et al. / Digital Investigation 28 (2019) S116eS125 S123

Table 3 seen in Table 3. In conclusion, hbft is the most memory efficient


Harddisk and Memory consumption of hashdb, fhmap and hbft for processing 2 GiB approach, followed by fhmap and hashdb. Nevertheless, it should be
of input data.
noted that only hashdb is able to work with databases which do not
Technique DISK RAM completely fit into RAM. In contrary hbft and fhmap will only work
hashdb 405.9 MiB 900.0 MiB if the databases fit into RAM.
fhmap 100.0 MiB 200.0 MiB
hbft 168.0 MiB 168.0 MiB Build phase

The creation of a database consists of several steps including the


different phases of processing. Since each file is a multiple of b in initialization of the different structures for data storage and
length, a fixed size extraction of blocks will not have to cope with handling. The extraction of blocks by splitting an input stream is
any alignment issues. considered for fixed blocks (similar to hash-based carving) or
Further extensions. Fixed block size hashing was additionally varying blocks (similar to approximate matching). In detail, we
implemented for hbft and fhmap to allow an uniform comparison. integrated a fixed block extraction (fi) and the extraction per roll-
However, the originally proposed chunk hashing function FNV-256 ing_hash (ro) for hbft and fhmap. Afterwards, in all cases the blocks
remained unchanged. FNV-1 is used for fhmap since the length of are hashed. In case of hashdb MD5 is used while hbft and fhmap use
FNV-256 is not necessary. In case of the parallelized rolling hash all FNV-2561 and FNV-1 respectively.
cores can be kept busy as long as the process of reading is faster The overall runtime is presented in Fig. 6a for a single-threaded
than the actual processing of extracted chunks. If the wait time for execution. Results for both block building approaches are displayed
I/O is slower than the processing time, only one thread will process and evaluated in the case of hbft and fhmap. The high discrepancy of
the image blocks. Remaining cores will not be utilized and the block runtime in case of hashdb is also caused by the setup of its metadata
building is bottlenecked by a slow I/O bus. hashdb can operate and relational features. The high difference from CPU to real time is
multi-threaded in all reading steps. As introduced in the course of related to read and write operations. Fig. 6b shows the results for a
this work, hfbt and fhmap feature multi-threading in its block multi-threaded execution (8 threads). The timings for the hbft and
building phase only. To allow a comparison, timings for single fhmap block building algorithms keeps constant since multi-
threaded usage of hashdb is given as well, where hashdb imple- threading is not implemented in the building phase yet. With the
ments a check of available system cores. This function was utilization of eight threads, hashdb cuts down its processing time
temporarily altered to always return a value of one. However, the tremendously. However, the approach is still slower than both
implementation uses a producer-consumer approach and will block building algorithms of hbft and fhmap.
spawn an additional thread anyways. By the usage of taskset we
finally forced the execution on a single core. Deduplication phase

Memory consumption Recalling the utilized set of random data, randomly generated
data does not feature any multi hits and thus, nearly all of the
The three approaches feature different memory requirements. extracted chunks result in unique hits. Our considered version of
After ingesting the 2 GiB test set hashdb produces files totaling in hashdb allows deduplication of an existing database per configu-
405.9 MB on disk. Since there is no theoretical background to ration. The associated counter value for all hashed blocks are
calculate the data structures size in main memory, it is approxi- checked and all blocks with a value higher than one are deleted. The
mated using the top command. The in memory size of the structure remaining chunks are written to a new database which finally
is about 900 MB. ensures unique matches during lookup. The size of the newly
In the case of fhmaps, the author9 mentions some storage established database stays the same. The in Section ‘Extensions to
overhead for the handling of key-value-pairs. The overhead will be hbft and fhmap’ introduced and implemented multihit mecha-
at a minimum of 8 bits per entry and will be padded to match with nisms of hbft and fhmap are executed in memory only. The runtime
the actual key length. Assuming a key length of 64 bit, then the results of the deduplication phase are displayed in Fig. 7a and b.
overhead for fhmap would be 64 bit as well. Assuming the test set Timings do not differ remarkably for single- or multi-threaded
of 2 GiB with a block size b ¼ 512, then the total amount of blocks n scenarios. The deduplication procedure for hashdb is slightly
will be 4; 194; 304. The total size s in main memory with a load slower since it needs to read the database from disk first. As the
factor of 0.5 would then result in s ¼ n,640:5 bits,3z200 MiB. The
rolling hash produces less blocks than a fixed-size extraction, the
allocation size on disk would be halved to approximately 100 MiB
caused by the load factor.
The size s of a hbft depends on various parameters and is mainly
influenced by the data set size, the block size b, and a desired false
positive rate fpr. In our scenario we approximate the root filter size
for the parameters ¼ 2 GiB, b ¼ 512 and fpr ¼ 106. This would lead
to an approximated root filter size of m1 ¼ m*215:84 z14MiB. The
tree consists of log2 ð4096Þ ¼ 12 levels. Thus, the total amount of
needed memory is approximately 12,14MiB ¼ 168MiB. The size on
disk will be approximately the same since the array and corre-
sponding Bloom filters need to be saved. Fur further details of the
hbft parametrization and calculation, we refer to Lillis et al. (Lillis
et al., Scanlon).
An overview of the required storage for each technique can be

Fig. 6. Build time performance of hashdb/hbft/fhmap. In case of hbft and fhmap we


9
https://probablydance.com/2017/02/26/i-wrote-the-fastest-hashtable/. consider fixed blocks (fi) and the extraction per rolling_hash (ro).
S124 L. Liebler et al. / Digital Investigation 28 (2019) S116eS125

produces less blocks which additionally vary in size.


Results of the benchmark are shown in Fig. 8. In Fig. 8a and b the
lookup performance in the case of single-threaded evaluations are
shown. Fig. 8c displays the parallelized version with a total amount
of eight running threads.
As shown fhmap features the fastest lookup followed by hbft
and lastly hashdb. The results underline the performance of fhmap
in all cases and the impact of fixed blocks, in contrary to the
overhead caused by computing a rolling hash. A significant
speeding up is gained by our proposed parallelization of the rolling
hash. The plot shows stable lookup results for all matching rates
with fhmap outperforming its competitors. Lookup times for hbft
Fig. 7. Deduplication performance of hashdb/hbft/fhmap with zero occurring multihits and fhmap increase slightly by a rising matching rate. The lookups
in a set. of hashdb are higher due to its complex internal structure.
Pre-filters for hbft and hashdb speed up the lookup time for
non-matches notably. In case of hbft the root Bloom filter will rule
overall amount of inserted chunks decreases and thus, the runtime out non matches instantly. Otherwise, a query needs to inspect
of deduplication improves. The fast deduplication time of fhmap is subsequent nodes, shown by the slightly increased lookup times for
caused by three facts: First, there are no special structures which higher matching rates. If a chunk does not match hashdb's com-
have to be additionally set up or evaluated. Second, the dedupli- pressed Hash Store the actual database is not queried either. A Hash
cation happens in the building phase as well, so there is no clear Store claims to have a false positive rate of 1 in 72 million with a
separation in building and deduplication for hash tables. Last, the database containing 1 billion hashes. However, every hash will be
random set does not feature any multi hits. hashdb and hbft need to queried against this store first before searching the actual database.
process their temporary databases and filter out multi hits before The presented flat hash map does not feature any pre-filtering so
inserting unique chunks in the actual database. In the case of far. Each key will be queried against the database. Performed tests
fhmap, previously marked multihit-entries are simply deleted from with different sized databases did not differ remarkably.
the database.
Conclusion

Lookup phase
In this work we discussed and evaluated three different
implementations of artifact lookup strategies in the course of dig-
The lookup consists of splitting an image in blocks, hashing
ital forensics. Several extensions have been proposed to finally
those blocks, and query them against the database. As we are only
perform a comprehensive performance evaluation of hashdb, hbft,
interested in efficiency (but not in detection performance), we
and fhmap. We introduced concepts to handle multihits for hbft
make use of a simple approach to simulate full and partial detection
and fhmap by the implementation of deduplication and filtration
scenarios. We create four different images which are queried
features. Moreover, we interfaced fhmap with a rolling hash based
against the databases. All images have a fixed size of 2 GiB. Each of
extraction of chunks. For a better comparison to hashdb, we addi-
the four images is constructed to match either 100%, 75%, 50%, or
tionally parallelized the extraction of chunks.
25% of the database. Again we point out that the different file
Results show that fhmap outperforms hbft in most of the
matching sizes are only used to investigate the efficiency behaviour
considered performance evaluations. While hbfts are faster than
dependent on different matching rates, i.e., the matching rate is the
hashdb in nearly all evaluations, the concept introduces false pos-
input parameter. Images with matching rates below 100% are
itives by the utilized Bloom filters. Even if hbfts have small ad-
partially filled with random bytes to reach the desired size of 2 GiB.
vantages in case of memory and storage efficiency, their
The size of every inserted file is a multiple of b ¼ 512 bytes. Thus,
complexity, fixed parametrization, and limited scope of features
images are crafted which do not cause alignment issues for a fixed
make such an advantage negligible. However, specific use cases
block extraction. It has to be considered that the rolling hash
with tight memory constraints could make hbfts still valuable.
Discussions of hashdb in terms of performance should consider
the underlying concept of single-level stores. Shifting the discus-
sion to offered features and a long term usage with an ongoing
maintenance, hashdb and fhmap are more suitable. One thing to
note is that hashdb is the only implementation that is able to deal
with databases which do not fit into main memory. In addition it
supports transactional features.
In Table 4 a final comparison of all three candidates in terms of
performance and supported features is given. The final overview
underlines the trade-offs between the concepts, where fhmap
shows a constant performance in most of the mentioned categories.

Future work

A concept similar to single-level stores for digital artifacts with


stable results in all of the mentioned categories is desirable. Where
most of the considered challenges rely on an high amount of en-
gineering effort first, the direct integration of a multihit prevention
Fig. 8. Lookup performance evaluation (real time). into a single-level store could be an interesting field of research.
L. Liebler et al. / Digital Investigation 28 (2019) S116eS125 S125

Table 4 new algorithm mrsh-v2. In: International Conference on Digital Forensics and
Final comparison of hashdb/hbft/fhmap in case of performance and offered features. Cyber Crime. Springer, pp. 167e182.
Breitinger, F., Baier, H., White, D., 2014. On the database lookup problem of
approximate matching. Digit. Invest. 11, S1eS9. Supplement 1.0 (2014). Pro-
ceedings of the First Annual DFRWS Europe, ISSN: 1742-2876.
Breitinger, F., Rathgeb, C., Baier, H., 2014. An efficient similarity digests database
lookup-a logarithmic divide & conquer approach. The Journal of Digital Fo-
rensics, Security and Law: JDFSL 9, 155.
Celis, P., Larson, P.-A., Munro, J.I., 1985. Robin hood hashing. In: Foundations of
Computer Science (Ed.), 26th Annual Symposium on, IEEE, pp. 281e288.
Chu, H., 2011. Mdb: a memory-mapped database and backend for openldap. In:
Proceedings of the 3rd International Conference on LDAP, Heidelberg, Germany,
p. 35.
Fan, B., Andersen, D.G., Kaminsky, M., Mitzenmacher, M.D., 2014. Cuckoo filter:
practically better than bloom. In: Proceedings of the 10th ACM International on
Conference on Emerging Networking Experiments and Technologies. ACM,
pp. 75e88.
Foster, K., 2012. Using Distinct Sectors in Media Sampling and Full Media Analysis to
Detect Presence of Documents from a Corpus, Technical Report. NAVAL POST-
GRADUATE SCHOOL MONTEREY CA.
Garfinkel, S.L., McCarrin, M., 2015. Hash-based carving: searching media for com-
plete files and file fragments with sector hashing and hashdb. Digit. Invest. 14,
S95eS105.
Gupta, V., Breitinger, F., 2015. How cuckoo filter can improve existing approximate
matching techniques. In: International Conference on Digital Forensics and
Cyber Crime. Springer, pp. 39e52.
Harichandran, V.S., Breitinger, F., Baggili, I., 2016. Bytewise approximate matching:
the good, the bad, and the unknown. Journal of Digital Forensics, Security and
Concepts to close the gap between performance-oriented memo- Law 11, 4.
ry-resistant lookup strategies and transactional database are Liebler, L., Breitinger, F., 2018. mrsh-mem: approximate matching on raw memory
needed. dumps. In: International Conference on IT Security Incident Management and IT
Forensics. IEEE, pp. 47e64.
Lillis, D., Breitinger, F., Scanlon, M., 2017. Expediting mrsh-v2 approximate matching
Acknowledgment with hierarchical bloom filter trees. In: International Conference on Digital
Forensics and Cyber Crime. Springer, pp. 144e157.
Pagani, F., Dell'Amico, M., Balzarotti, D., 2018. Beyond precision and recall: under-
This work was supported by the German Federal Ministry of standing uses (and misuses) of similarity hashes in binary analysis. In: Pro-
Education and Research (BMBF) as well as by the Hessen State ceedings of the Eighth ACM Conference on Data and Application Security and
Ministry of Higher Education, Research and the Arts within CRISP Privacy. ACM, pp. 354e365.
Young, J., Foster, K., Garfinkel, S., Fairbanks, K., 2012. Distinct sector hashes for target
(www.crisp-da.de). file detection. Computer 45, 28e35.

References

Breitinger, F., Baier, H., 2012. Similarity preserving hashing: eligible properties and a

You might also like