Scalable Blocking For Very Large Databases
Scalable Blocking For Very Large Databases
Andrew Borthwick? , Stephen Ash? , Bin Pang, Shehzad Qureshi, and Timothy
Jones
1 Introduction
In contrast to prior work such as [3] which seeks to build an optimal set
of fields on which to block records, the philosophy of the dynamic blocking
family of algorithms [19] is to avoid selecting a rigid set of fields and instead
dynamically pick particular values or combinations of values on which to block.
As an example of blocking on a fixed, static set of blocking key fields, consider
a system to deduplicate U.S. person records by simply proposing all pairs of
persons who match on the field last name. This would be prohibitively expensive
due to the necessity of executing a pairwise matching algorithm on each of the
1,400,000
2 pairs of people who share the surname “Jones”. A pairwise scoring
model averaging 50 µsecs would take ≈ 567 days to compare all “Jones” pairs.
On the other hand, suppose that we statically select the pair of fields (first name,
last name) as a single blocking key. This solves the problem of too many “Jones”
records to compare, but is an unfortunate choice for someone with the name
“Laurence Fishburne” or “Shehzad Qureshi”. Both of these surnames are rare in
the U.S. A static blocking strategy which required both given name and surname
to match would risk missing the pair (“laurence fishburne”,“larry fishburne”) or
(“shehzad qureshi”,“shezad qureshi”). Differentiating between common and less
common field/value pairs in the blocking stage fits with the intuition that it is
more likely that two records with the surname “Fishburne” or “Qureshi” rep-
resent the same real-world individual than is the case for two records with the
surname “Jones”, which is an intuition backed up by algorithms that weight
matches on rare values more strongly than matches on common values [6, 17,
28].
This work makes the following contributions: (1) We describe a new algo-
rithm called Hashed Dynamic Blocking (HDB) based on the same underlying
principle as dynamic blocking [19], but achieves massive scale by minimizing data
movement, using compact block representation, and greedily pruning ineffective
candidate blocks. We provide benchmarks that show the advantages of this ap-
proach to blocking over competing approaches on huge real-world databases.
(2) Our experimental evidence emphasizes very large real-world datasets in the
range of 1M to 530M records. We highlight the computational complexity chal-
lenges that come with working at this scale and we demonstrate that some
widely cited algorithms break down completely at the high end. (3) We describe
a version of Locality Sensitive Hashing applied to blocking that is easily tunable
for increased precision or increased recall. Our application of LSH can generate
(possibly overlapping) blocking keys for multiple columns simultaneously, and
we provide empirical evaluation of LSH versus Token Blocking to highlight the
trade-offs and scaling properties of both approaches.
1
LSH(1, 1)
LSH(3, 8)
2,9 3,8 4,7 12,7
0.8 LSH(6, 7) OFF 5,6 6,5
Probability of Blocking
7,4
LSH(10, 6)
1,1 14,4
LSH(12, 5) −2.5
10
LSH(14, 4) 8,3
0.6
P Q (log-scale)
LSH(16, 3) 16,3
0.4 4,1
10−3 16,2
0.2
8,1
TB
10−3.5 16,1
0
0 0.2 0.4 0.6 0.8 1 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Jaccard Similarity PC
Like most other blocking approaches such as Meta-blocking [21], dynamic block-
ing begins with a set of records and a block building step that computes a set
of top-level blocks, which is a set of records that share a value computed by
a block blocking process, t, where t is a function that returns a set of one or
more blocking keys when applied to an attribute ak of a single record, r. The
core HDB algorithm described in Section 3 is agnostic to the approach to block
building.
With structured records, one can use domain knowledge or algorithms to pick
which block building process to apply to each attribute. We term Identity Block
Building as the process of simply hashing the normalized (e.g. lower-casing, etc.)
attribute value concatenated to the attribute id to produce a blocking key. Thus
the string “foo” in two different attributes returns two different top-level blocking
keys (i.e. hash values). For attributes where we wish to allow fuzzier matches
to still block together, we propose LSH Block Building as described in the next
section. Alternatively, Token Blocking [21] is a schema-agnostic block building
process where every token for every attribute becomes a top-level blocking key.
Note that, unlike Identity Blocking Building, the token “foo” in two different
attributes will return just a single blocking key.
In this work we propose a new block building approach which incorporates Lo-
cality Sensitive Hashing (LSH) [15, 13] with configurable parameters to control
the precision, recall trade-off per column. LSH block building creates multiple
sets of keys that are designed to group similar attribute values into the same
block. We leverage a version of the algorithm which looks for documents with
high degrees of Jaccard similarity [18, 27] of tokens. Here a token could be defined
as a word, a word n-gram, or as a character q-gram.
[18] describes a method in which, for each document d, we first apply a min-
hash algorithm [5] to yield m minhashes and we then group these minhashes into
b bands where each band consists of w = m/b minhashes. In our approach each
of these bands constitutes a blocking key. Now consider a function LSH(b, w, j),
in which b and w are the LSH parameters mentioned above and j is the Jaccard
similarity of a pair of records, then LSH(b, w, j) is the probability that the at-
tributes of two records with Jaccard similarity of j will share at least one key and
b
can be computed as: LSH(b, w, j) = 1 − (1 − j w ) LSH(b, w, ·) has an attractive
property in that the probability of sharing a key is very low for low Jaccard sim-
ilarity and very high for high Jaccard similarity. Figure 1 graphs LSH(b, w, ·)
for various values of (b, w), which gives us a range of attractive trade-offs on
the Pair-Quality (i.e. precision) versus Pair-Completeness (i.e. recall) curve by
varying the two parameters for LSH, b and w.
Count-Min Sketch
Count-Min Sketch
Deduped Over-sized
Any Over-sized
Right-
Exact Counts
Bloom Filter
sized
Over-
counted
Duplicate
Too
Similar
Fig. 2: Diagram illustrating how candidate blocks are processed in Hashed Dy-
namic Blocking
5: P ← {bi , bj | bi ∈ b, bj ∈ b ∧ bi < bj }
6: parallel for (bi , bj ) ∈ P do
7: x.key ← Murmur3(b.keyi , b.keyj )
. the new block’s size is unknown at this point, but we carry the smallest parent’s
size
8: x.psize ← min(b.sizei , b.sizej )
9: R[rid] += x
10: end for
11: end for
12: return R
13: end function
new hashes computed by combining every pair of existing over-sized hashes for
that record. There are some blocking key intersections which we do not want
to produce. For example, if a dataset had 4 nearly identical over-sized blocks,
then after thefirst intersection these 4 columns would intersect with each other
to produce 42 = 6 columns, but since these were already over-sized and nearly
identical, the intersected columns would be over-sized as well. This quadratic
growth of over-sized blocking keys per record would not converge. To avoid this
hazard, we apply a progress heuristic and only keep blocking key intersections
Algorithm 3 Rough Over-sized Block Detection
Input: K, a dataset of rid to over-sized blocks, b0..n where bi is a 2-tuple of the block
key hash, b.keyi , and the count of records in the parent block, b.psizei
Output: KR , a dataset of record to right-sized blocks
Output: K̃O , a map of record to possibly over-sized blocks
1: function RoughOversizeDetection(K)
2: cms ← ApproxCountBlockingKeys(K)
3: KR ← ∅
4: K̃O ← ∅
5: parallel for (rid, b0..n ) ∈ K do
6: for all bi ∈ b do
7: s ← cms[b.keyi ]
8: p ← b.psizei
9: if s ≤ MAX BLOCK SIZE then
10: KR [rid] += bi . right-sized
11: else if (s/p) ≤ MAX SIMILARITY then
12: K̃O [rid] += bi . over-sized
13: end if
. We discard over-sized blocks that are too similar in size to parent
14: end for
15: end for
16: return KR , K̃O
17: end function
that reduce the size of the resulting blocks by some fraction, MAX SIMILARITY.
This heuristic filter is applied in Algorithm 3, using the minimum parent block
size which we propagate on line 2.8.
we subsequently union into this iteration’s right-sized blocks (line 1.10); (2) du-
plicate over-sized blocks, which we discard; (3) surviving, deduplicated over-
sized blocks, KO , with precise counts of how many records are in each, which
are then further intersected in the next iteration.
Block A duplicates block B if block A’s record IDs are equal to block B’s.
We arbitrarily discard duplicate blocks, leaving only a single surviving block
from the group of duplicates, in order to avoid wasting resources on identical
blocks that would only continue to intersect with each other, but produce no new
pair-wise comparisons. We do this exact count and dedup in parallel in one map-
reduce style operation. To deduplicate the blocks we build a block membership
hash key by hashing each record ID in the candidate block and bit-wise XORing
them together. Since XOR is commutative, the final block membership hash key
is then formed (reduced) by XORing the partial membership hash keys.
On line 4.6, we discard duplicate copies of blocking keys that have the same
block membership hash key. From these deduplicated blocking keys, HU , we
create a string multiset, counts, to precisely count the over-sized blocking keys.
Even in our largest dataset of over 1 billion records, the largest count of oversized
blocks in a particular iteration after deduplication is ≈2.6M which easily fits into
memory, but if this memory pressure became a scaling concern in the future, we
could use another Count-Min Sketch here.
Lastly, we need to distinguish the erroneously over-counted blocks which
are actually right-sized, K̂R , from the surviving, deduplicated blocks, HU . On
line 4.8 we build a Bloom filter [4] over all of the over-sized blocking keys, HO ,
which contains both duplicate and surviving over-sized blocks as determined
by precise counting. Therefore, the Bloom filter answers the set membership
question: is this blocking key possibly over-sized? In this way, we use this filter
as a mechanism to detect right-sized blocks that were erroneously over counted.
We build the Bloom filter using a large enough bit array to ensure a low expected
false positive rate of 1e−8. Even in our largest dataset, the biggest Bloom filter
that we have needed is less than ≈100MB.
4 Prior Work
4.1 Prior work on Dynamic Blocking
The need for Hashed Dynamic Blocking may be unclear since its semantics (the
pairs produced after pair deduplication) are essentially the same as that of [19]
for scalar-valued attributes. Relative to [19], this work offers the following ad-
vantages: (1) [19] had a substantial memory and I/O footprint since the content
of the records being blocked had to be carried through each iteration of the
algorithm. (2) LSH would have been challenging to implement in the Dynamic
Blocking algorithm of [19] as it did not contemplate blocking on array-valued
columns.
Table 1: Datasets used for experiments where BB indicates (L)SH or (T)oken
block building strategy and positive labels marked with † are complete ground
truth. Datasets marked C are Commercial datasets.
Moniker Records +Labels Cols BB Src
Moniker Records +Labels Cols BB Src
VAR1M 1.03M 818 60 L C
VOTER 4.50M 53,653† 108 L [1]
VAR10M 10.36M 8,890 60 L C
SCHOLAR 64,263 7,852† 5 L [16]
VAR25M 25.09M 20,797 60 L C
CITESR 4.33M 558k† 7 L [26]
VAR50M 50.02M 40,448 60 L C
DBPEDIA 3.33M 891k† — T [11]
VAR107M 107.58M 80,068 60 L C
FREEB 7.11M 1.31M† — T [11]
VAR530M 530.73M 76,316 60 L C
5 Experimental Results
5.1 Datasets
5.2 Metrics
been unable to get it to complete. We ran into similar issues when running
BLAST [26] on our huge datasets, which we expected given that they broadcast
hash maps of record ID → blocking keys to every node. For our large datasets,
this single broadcast map would be multiple TBs of memory.
We note that HDB demonstrates improved recall over PMB despite PMB
producing more pairs to evaluate. We believe this may be a consequence of
the heuristic of meta-blocking weighting pairs that occur in multiple blocks. In
the case of LSH-based blocking keys where there are many highly overlapping
blocks, this may result in PMB picking many redundant pairs that don’t improve
compression. HDB by contrast, prefers to focus on the blocks that are small
enough to thoroughly evaluate and find intersections of over-sized blocks. This
may produce more diversity in the pairs emitted by HDB compared to PMB.
6 Conclusions
We have shown Hashed Dynamic Blocking being applied to different large datasets
up to 530M records. We also introduced the LSH-based block building technique,
and illustrated its usefulness in blocking huge datasets. The Hashed Dynamic
Blocking algorithm leverages a fortunate convergence in the requirements for
0.8 1 10−2 1
10−2 1
0.8 0.9
0.6 0.95 −2.2
10
(log-scale)
(log-scale)
0.6 0.8
PQ
0.4 0.9
PC
PC
PC
10−2.4
0.4 0.7
PQ
PQ
10−3
0.2 0.85
−2.6 0.6
10 0.2
0 0.8 0 0.5
OFF 3,8 6,7 10,6 12,5 14,4 16,3 OFF 3,8 6,7 10,6 12,5 14,4 16,3 OFF 3,8 6,7 10,6 12,5 14,4 16,3
efficiency and accuracy. HDB accomplishes this through a new algorithm which
iteratively intersects and counts sets of record IDs using an inverted index and
approximate counting and membership data structures. This efficient implemen-
tation is fast, robust, cross-domain, and schema-independent, thus making it an
attractive option for blocking large complex databases.
References