Aurum A Data Discovery System
Aurum A Data Discovery System
Raul Castro Fernandez, Ziawasch Abedjan# , Famien Koko, Gina Yuan, Sam Madden, Michael Stonebraker
#
MIT <raulcf, fakoko, gyuan, madden, stonebraker>@csail.mit.edu TU Berlin [email protected]
Abstract—Organizations face a data discovery problem when and search algorithms to solve a predetermined set of use
their analysts spend more time looking for relevant data than cases; they do not solve the more general discovery problem
analyzing it. This problem has become commonplace in modern presented here. For example, systems such as Goods [1],
organizations as: i) data is stored across multiple storage systems,
from databases to data lakes, to the cloud; ii) data scientists do Infogather [2] and Octopus [3] do not provide a flexible way
not operate within the limits of well-defined schemas or a small to support different discovery queries than those they were
number of data sources—instead, to answer complex questions designed to solve. Analysts want to find datasets according to
they must access data spread across thousands of data sources. To different criteria, e.g., datasets with a specific schema, similar
address this problem, we capture relationships between datasets datasets of one of reference, joinable datasets, etc. To avoid the
in an enterprise knowledge graph (EKG), which helps users to
navigate among disparate sources. The contribution of this paper productivity bottleneck of discovering new data sources, we
is AURUM, a system to build, maintain and query the EKG. To need a more general structure that organizes the data sources,
build the EKG, we introduce a Two-step process which scales to represents their relationships, and allows flexible querying;
large datasets and requires only one-pass over the data, avoiding otherwise data discovery becomes a bottleneck that hampers
overloading the source systems. To maintain the EKG without productivity within organizations.
re-reading all data every time, we introduce a resource-efficient
sampling signature (RESS) method which works by only using a This paper is concerned with the design and implementation
small sample of the data. Finally, to query the EKG, we introduce of a discovery system that permits people to flexibly find
a collection of composable primitives, thus allowing users to relevant data through properties of the datasets or syntactic
define many different types of discovery queries. We describe relationships amongst them, such as similarity of content or
our experience using AURUM in three corporate scenarios and schemas, or the existence of primary key/foreign key (PK/FK)
do a performance evaluation of each component.
links. The key requirements of such a system are: 1) it
I. I NTRODUCTION must work with large volumes of heterogeneous data; 2) it
must incrementally adapt to continuously evolving data; and
With a myriad of data sources spread across multiple 3) it must provide querying flexibility to support the varied
heterogeneous databases, modern organizations face a data discovery needs of users.
discovery problem. Analysts spend more time finding relevant We represent relationships between the datasets in a data
data to answer the questions at hand than analyzing it. structure we call the enterprise knowledge graph (EKG),
For example, consider an analyst at a large drug company which analysts use to solve their discovery problems. The
who is assigned the task of predicting the change in stock contribution of this paper is AURUM, a system to build,
price of the company after a presidential election. To build her maintain and query the EKG:
model, she decides she needs i) the company stock variations • Build: Building the EKG requires performing an all-pairs
in recent elections; ii) the mentions of the company in social comparison (O(n2 ) time) between all datasets for each of the
media channels; iii) the number of drugs about to be approved relationships we wish to find. With a large number of sources
by the FDA; and iv) the current productivity of the research this quadratic cost is infeasible because of its time and because
department. She needs to find tables with all of these attributes, of the need to read external sources multiple times, thus
as well as additional attributes that can be used to link the incurring IO overhead in the external systems. We introduce
different drugs, years, and mentions together. The discovery a two-step process that consists of a profiler that summarizes
challenge is to find data sources containing this information all data from sources into space-efficient signatures by reading
among the many thousands of tables in RDBMS, data lakes data only once, and a graph builder, which finds syntactic
and warehouses in the organization. Doing this manually is relationships such as similarities, PK/FK candidates in O(n)
extremely labor intensive, as the analyst has to browse a large using sketching techniques.
number of files and tables to find those that might match, and • Maintain: Because data is always changing (req. 2), we
then try to find key attributes that can be used to join these must maintain the EKG in a scalable manner: we introduce
tables together. Our work with data analysts at a number of a resource-efficient signature sampling (RESS) method that
organizations, including Merck, British Telecom, the City of avoids repeatedly re-reading source data, while still keeping
New York, and our university suggests that they all struggle the EKG up-to-date.
with such problems on a daily basis. • Query: Finally, because discovery needs are varied we
Although some companies have built in-house solutions to cannot pre-define queries (req. 3). Instead, we introduce a
help their analysts find files in their data lakes and databases, source retrieval query language (SRQL) based on a set of
these systems are engineered around a customized index discovery primitives that compose arbitrarily, allowing users to
express complex discovery queries. We implement SRQL on a SRQL
RDBMS-based execution engine, augmenting it with a graph Query
XML
SQL
well as performance benchmarks. CSV
XML
JSON
JSON
CSV
XML
TABLE II
E DGE AND PATH PRIMITIVE PROCESSING TIMES OF DIFFERENT SYSTEMS ON DIFFERENT GRAPHS . T IMES IN MILLISECONDS .
primitives useful to express their discovery queries. Not all with other specialized graph systems to support this claim.
users used all of them, though. For example, in the case of Also note that specialized graph systems do not support
sustainability, no path primitives were necessary. In that case, some of our requirements, e.g., provenance. We evaluate the
however, ranking was crucial, as the number of sources was performance of G-Index with 3 queries: a 2-hop and 5-hop
larger than in other use cases. In general, we had very positive path query, and an edge primitive. We compare the runtime
experience working with the users and AURUM. with other 4 state-of-the-art systems.
How do they solve discovery today? We asked them Datasets. We use 7 different EKGs, three of them (DWH,
to describe how they solve data discovery problems today, C H EMBL and M ASS DATA) are built on real datasets, and
choosing from 6 options (they could select more than one the other 4 are generated following the Erdos-Renyi model
choice). The options were asking other people or more senior for random graphs, which we use to expose the limits of the
employees where is the data (3 votes), writing SQL queries and systems. DWH is a public subset of the data warehouse of
use database metadata to understand the content (3 votes), our university with around 200 tables. C H EMBL is a well
Manual work: writing scripts, visual inspection (3 votes), Use known public chemical database [9] with around 70 tables.
some commercial tool to navigate the sources (1 vote), use M ASS DATA is a repository of open governmental data for the
some open source tool to navigate the sources (2 votes), and state of Massachusetts, with around 300 tables.
other (0 votes).
Setup. We compare 5 systems in total, including ours. Before
3) End-to-end deployment: In the next section, we evaluate
running a system, we spent time tuning it until we achieved
the performance of each of Aurum’s stages in isolation. In all
the best results. Neo4J [13] is a state-of-the-art single-node
the deployments we have described here, most time is spent
graph processing system. We configured the database with
during the profiling stage, i.e., the first step of our Two-step
indexes to speed up the different queries. All the queries are
process. In addition to being the most computationally expen-
written in Cypher, Neo4J’s language. JanusGraph [14] is an
sive process, reading from source systems is often limited to
open source version of Titan, a distributed graph database
avoid cluttering other workloads.
system. We configured it to run on top of Cassandra, because
Conclusion. Finally, when asked how likely they are to use it was the storage backend that yielded the best results. For
AURUM in their organization, three users gave a 4 and one JanusGraph, we used Gremlin as the query language. Virtuoso
of gave a 5. We are currently working with them to achieve is a well-known database for processing RDF data. In the
a deeper integration in their environments, with the hope of case of Virtuoso, we serialized the EKG as a repository of
discover more details and related problems. The experiments triples in RDF format, as this was the format in which we
so far with AURUM helped us to confirm that AURUM is found Virtuoso to perform the best. We query the system using
useful for varied discovery needs. SPARQL. Last, we run the queries on standalone PostgreSQL
configured with indexes. All systems run on a MacBook Pro
B. Query: SRQL Execution Engine Performance with 8GB of memory.
Next we analyze how the G-Index speeds up discovery Note that, SPARQL in Virtuoso uses an abstraction called
path primitives. Our goal is to show that G-Index on top of property graph—not to be confused with the property graph
a relational database is sufficient to cope with the discovery model [15]—to express path queries. This abstraction does not
needs of querying the EKG, and that specialized graph systems support path primitives defined on more than one edge type:
that run general graph workloads are not needed. We compare this means the system is not suitable for SRQL. We included
5000 3500
Index 3500
4500 TF-IDF
NoIndex-512 3000 3000
Fig. 2. Profiler runtimes for 3 configurations Fig. 3. Profiler runtime with two grain sizes Fig. 4. Graph builder runtimes with increase DEs
the results for completeness. incrementally—was of 20s for the largest EKG. Finally, G-
Results. We report average, 95, and 99 percentile runtime over Index’s memory footprint is on the order of 200MB in the
10 runs for each query in Table II. A cross ( 7 ) indicates a case of the smallest EKG and below 2GB in the case of the
system run out of memory or 2 min were elapsed. largest. This makes sense, the index, written in C++, must only
Edge primitives SRQL G-Index is 2 orders of magnitude keep an efficient bitmap to represent the connections between
faster than Neo4J, and 1 order of magnitude faster than nodes, which are itself represented with integers.
Virtuoso—edge queries are efficient in the underlying Post- Conclusion. With G-Index, discovery primitives, including
greSQL. Although the absolute runtime is low (10ms), the path queries, execute within human-scale latencies.
runtime will multiply with the total number of input DEs, so
it is important to execute these efficiently. C. Build: Two-step Process
The EKG structure has limited impact on the performance
We measure the performance of the profiler and graph
of the edge primitives, but significantly impacts the runtime
builder to build the EKG (V-C1) as well as the quality of
of path primitives: we ran two discovery path queries, with a
the links produced (V-C2).
maximum of 2 and 5 hops and present the results next.
2-Hop Path primitives On the real data EKGs, G-Index is 1) Profiler and Graph Builder efficiency: To evaluate the
20x faster than Virtuoso, which is the second fastest system. profiler, we measure the time it takes to index varying amounts
On the synthetic EKGs, G-Index performs better, but on-par of data. We ran 3 configurations of the profiler. Using the
with Virtuoso and Neo4J. G-Index speeds up short path queries Index configuration, we ran the profiler and also indexed the
over PostgreSQL (by several orders of magnitude on real data into elasticsearch—necessary to answer lookup primi-
EKGs), while PostgreSQL is competitive with Neo4J and even tives. With NoIndex, we only measure profiling time; we ran
with Virtuoso in the case of the real EKGs. this configuration with a MinHash signature of 512 (NoIndex-
5-Hop Path primitives The biggest impact of G-Index is 512) and 128 permutations (NoIndex-128). We used the
shown with the more complex 5-hop path queries. None DWH dataset, which we replicated to achieve the target
of the other systems completed all the queries over all the input data size, while keeping data skewness. We ran these
EKGs, while G-Index finished them successfully. When the experiments on a machine with 32 cores and SSD disk.
other systems finished the queries, the G-Index performed The results in Fig. 2 show that all modes scale linearly
significantly better. For example, for the real data EKGs, G- with the input data size. The Index mode takes the longest
Index is 2 orders of magnitude faster. For the synthetic ones the time because the indexing speed of elasticsearch becomes
time is similar to Virtuoso and Neo4J for the least connected the processing bottleneck. To factor out this bottleneck and
EKG, but these systems did not finish the query for any of the evaluate the performance of the profiler we built, we run the
other synthetic EKGs. two NoIndex configurations. In the case of NoIndex-512,
Loading time and memory consumption We measured the limiting factor is computing the signature; when we run
the loading times for the different systems. This times include NoIndex-128, which is 4x cheaper to compute than NoIndex-
creating the necessary indexes, for example, creating the G- 512 due to the reduced number of hashing operations, we
Index in the case of our solution. PostgresSQL achieves become limited by the data deserialization routines we use.
the fastest loading times, and G-Index adds negligible over- Nevertheless, these results are very positive. For example, the
head on top when created incrementally—note we do not profiler took only slightly above 1.5 hours to build signatures
need transactional writes to G-Index because the EKG is for the 250GB of data.gov that we used in our deployment
written first and read later, so the process is very efficient. with the sustainability team.
Including the incremental creation of G-Index, PostgreSQL What about task granularity? We want to evaluate the bene-
takes 42s to load the smallest EKG and 160s to load the fits of using finer-grain tasks (Fine-Grain) in comparison with
largest one. The second fastest system, Neo4J, took 134s and creating a task per table (Coarse-Grain). Fig. 3 shows how,
669s, respectively. Virtuoso, which creates many indexes to for the same dataset used above, Fine-Grain achieves better
speed up query execution took 162s and 2234s, which is performance than Coarse-Grain; real datasets are skewed and
14x more time than PostgreSQL. G-Index is an in-memory this justifies the design of our profiler.
structure so after a shutdown, the index must be recomputed. Graph Builder. To evaluate the performance of the graph
The time to recompute G-Index from scratch—as opposed to builder, we measure the runtime as a function of the number
Precision/ Precision/
of input profiles with the content-similarity relationship (DEs Dataset
#Sim. Pairs
Recall Recall
in the x axis). actual (δ = 0.7)
K=512 K=128
The results in Fig. 4 show linear growth of the graph builder DWH 8872 56%/92% 66%/71%
Chembl 11 57%/100% 42%/100%
when using both MinHash and TF-IDF—as expected. To MassData 28297 90%/88% 89%/91%
obtain the TF-IDF signature, the graph builder must read data
from disk, so as to calculate the IDF across the profiles— TABLE III
ACCURACY RESULTS FOR CONTENT SIMILARITY
which explains its higher runtime. MinHash signatures are
created as a part of profiling process, hence its lower runtime. builder, which computes signatures and relationships in linear
2) Accuracy Microbenchmarks: Approximate PK/FK re- time. Despite using approximate methods, the quality of the
lationship. We measure the precision and recall of candidate relationships in the EKG suggests a reasonable approximation
PK/FK generated by AURUM in 4 different datasets for which trade-off against the expensive exact method.
we had ground truth. The results are in Fig. 7.
For TPC-H, our approach achieved similar quality as re- D. Maintain: RESS
ported by PowerPivot [7] and better than reported by Random- In our final experiment, we evaluate the efficiency of our
ness [16]. Similarly, our approach achieved good results for the resource-efficient signature sampling (RESS) method. Our
FDB dataset, which consists of financial information from our goal is to understand whether we can efficiently detect data
university. Chembl [9] contains a larger number of declared changes, so that we can keep the EKG up-to-date at low cost.
PK/FKs in their schema. In this case, we used the technique Synthetic experiment: In the first experiment (Fig. 5), we
presented by PowerPivot [7] of filtering PK/FKs by those created 100 columns with 1K values drawn from a uniform
fields with similar attribute names, which was straightforward distribution. We then change 50 of the columns by uniformly
to express in a simple SRQL query with 2 primitives. After altering in a varying degree the size of the changes (shown in
further inspection of the results, we found that some of the the X axis). We then run RESS and report the precision and
candidate PK/FKs found by AURUM are in fact semantically recall of the method when compared with the real changes.
reasonable joins, even though they are not included in the Both metrics are higher when the data has changed more often.
official PK/FK constraints defined by the schema (confirmed Even when the data does not change much, the recall is very
by an expert). When they were false positives, users found it high, and the precision means that only a few additional data
easy to ignore. Finally, we evaluated our method on a dataset must be re-read to verify. To further evaluate RESS, we run a
used by one of our collaborators, and accessed by hundreds of second experiment with real data.
analysts daily (Pharma). In this case both the precision and Real data: We obtained two versions of the C H EMBL
recall of the relationships we found was above 90%, despite database. We used version 21 (which we have used in the
the large number of relationships available. rest of this evaluation) and version 22. ChEMBL updates
Content similarity relationship. We compare the quality of their database version every few months, adding and removing
the content similarity relationships generated with MinHash new data to the existing relations, so we built an EKG
and LSH with ground truth obtained by running an exact all- using ChEMBL21 and then changed the underlying data to
pairs similarity method. We use 3 datasets (we did not use ChEMBL22, therefore simulating data evolution. At this point,
Pharma because we could not run the expensive all-pairs we run EKG to identify the magnitude of change for the
method in the production environment). The results are in columns in the dataset. We then run RESS with different
table III. We do not show the results for for TPC-H and FDB, sample sizes and compute the absolute error between the
for which our approach achieved 100% precision and recall. estimated magnitude of change and the true change—which
Instead, we add the M ASS DATA dataset and show results we compute and use as ground truth (420 of the underlying
for MinHash with 512 and 128 permutations. For all three columns changed). The results in Fig. 6 show the error in the
datasets we achieve a high recall, with 100% in the case of x axis and the percentage of columns in the y axis.
ChEMBL and 88% in the case of M ASS DATA. The precision RESS identifies 90% of the modified datasets by reading
for M ASS DATA is also high, while for DWH and C H EMBL only 10% of the original data, with a maximum error of less
we achieved a precision of around 50%—note that we are than 15%. When using 25% of the data the error reduces to
measuring exactly those relationships above the configured 10%. The baseline error is of 5%. For the rest of sample sizes
0.7 similarity threshold—we verified that many of the false shown in the figure, RESS behaves as expected: the bigger the
positives actually have a high similarity score. sample size the lower the error.
Even when we use the cheaper-to-compute MinHash sig- VI. R ELATED W ORK
nature with 128 permutations, we still achieve high quality
results. This is important, because computing the MinHash Enterprise search systems. LinkedIn has open sourced
signature takes a considerable time during the profiling stage WhereHows [17]. The Apache foundation offers Atlas [18]
as we have seen in the previous section. and Google has Goods [1]. All these systems have a strong
In conclusion, our two-step process to build the EKG scales emphasis on recording the lifecycle of datasets through lineage
linearly with the input data by means of the profiler and graph information, which is orthogonal to the problem we solve
1
100
90 #FKs
0.8
80 Dataset Precision/Recall
actual
% columns
0.6
Score
60 TPC-H 11 100%/81%
0.4 sample-100%
40
sample-75%
FDB 7 100%/100%
0.2 sample-50%
Precision
Recall
20 sample-25% Chembl 85 35%/67%
0 sample-10%
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 Pharma 431 100%/91%
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Perturbation Ratio
Error Fig. 7. Effectiveness of the PK/FK discovery
Fig. 5. Precision and Recall as a function of change
Fig. 6. Absolute error for different sample sizes.
with 10% sample
in this paper. These systems provide some subset of search Acknowledgements: We thank the many users and industrial
or discovery features, such as finding similar datasets, or collaborators that participated in the user studies we conducted
inspecting existing ones, but none of them permit users to for the evaluation.
change the discovery query on-demand.
R EFERENCES
Exploratory systems. Octopus [3] and Infogather [2] solve
[1] A. Halevy et al., “Goods: Organizing Google’s Datasets,” in SIGMOD,
the problem of schema complement. Finding related tables [19] 2016.
focuses on finding tables that are candidates for joins, similar [2] M. Yakout et al., “InfoGather: Entity Augmentation and Attribute
to schema complement, or candidates for union. Finally, Discovery by Holistic Matching with Web Tables,” in SIGMOD, 2012.
[3] M. J. Cafarella, A. Halevy et al., “Data Integration for the Relational
in [20] the authors build a system with a particular set of Web,” VLDB, 2009.
operations that solve use cases in the context of oceanographic [4] A. Andoni and P. Indyk, “Near-optimal Hashing Algorithms for Approx-
data. All the above systems rely on custom indexes, specific imate Nearest Neighbor in High Dimensions,” Commun. ACM, 2008.
[5] M. Bawa, T. Condie, and P. Ganesan, “LSH Forest: Self-tuning Indexes
to each use case. In contrast, we take a more general approach for Similarity Search,” in WWW, 2005.
to the data discovery problem, building an EKG that we use [6] Q. Lv, W. Josephson et al., “Multi-probe LSH: Efficient Indexing for
to flexibly answer varied discovery needs, and that can then High-dimensional Similarity Search,” in VLDB, 2007.
[7] Z. Chen, V. Narasayya et al., “Fast Foreign-key Detection in Microsoft
be used as the basis for new, more complex relationships. SQL Server PowerPivot for Excel,” VLDB, 2014.
IR and databases. There is a plethora of systems to perform [8] J. Cheney, L. Chiticariu et al., “Provenance in Databases: Why, How,
and Where,” Found. Trends databases, 2009.
keyword search in RDBMS [21], [22], [23], [24], [25], [26]. [9] E. L. Willighagen, A. Waagmeester et al., “The chembl database as
Most of these systems can be seen as a constrained imple- linked open data,” Journal of Cheminformatics, 2013.
mentation of our discovery primitives, on an EKG with DE [10] D. S. Wishart, C. Knox et al., “DrugBank: a knowledgebase for drugs,
drug actions and drug targets.” Nucleic Acids Research, 2008.
granularity of values. Then, most of these systems propose to [11] T. Liu et al., “BindingDB: a web-accessible database of experimentally
connect the results through a Steiner tree, so that they can rank determined proteinligand binding affinities,” NAR, 2007.
them appropriately. Since we maintain a provenance graph of [12] J. Hastings et al., “The ChEBI reference database and ontology for
biologically relevant chemistry: enhancements for 2013,” NAR, 2012.
the execution of SRQL queries, we could find a Steiner tree in [13] Neo4J, “Neo4J,” https://neo4j.com, 2017.
such graph, therefore offering similar functionality. We have [14] JanusGraph, “JanusGraph: Distributed graph database,”
built AURUM for more general discovery cases that cannot be https://janusgraph.org, 2017.
[15] W. Sun, A. Fokoue et al., “SQLGraph: An Efficient Relational-Based
solved with keyword search alone. Property Graph Store,” in SIGMOD, 2015.
[16] M. Zhang, M. Hadjieleftheriou et al., “On multi-column foreign key
Other links for the EKG. We designed Aurum to support discovery,” VLDB, 2010.
relationships useful to discover and connect datasets within [17] LinkedIn, “WhereHows: A Data Discovery and Lineage Portal,” 2016.
organizations. We discussed some syntactic relationships but [18] Apache, “ATLAS: Data Governance and Metadata framework for
Hadoop,” http://atlas.incubator.apache.org, 2016.
left many others out. For example, work on finding functional [19] D. Sarma et al., “Finding Related Tables,” in SIGMOD, 2012.
dependencies [27], as well as correlation among columns [28], [20] V. M. Megler et al., “Are Data Sets Like Documents?: Evaluating
[29] is complementary to our work. Similarity-Based Ranked Search over Scientific Data,” TKDE, 2015.
[21] K. Chen, J. Madhavan, and A. Halevy, “Exploring schema repositories
with schemr,” 2009.
VII. C ONCLUSION [22] S. Agrawal, S. Chaudhuri et al., “DBXplorer: Enabling Keyword Search
over Relational Databases,” in SIGMOD, 2002.
We presented AURUM1 , a data discovery system that builds, [23] B. Aditya, G. Bhalotia et al., “BANKS: Browsing and Keyword Search-
ing in Relational Databases,” in VLDB, 2002.
maintains and allows users to query an enterprise knowledge [24] V. Hristidis and Y. Papakonstantinou, “Discover: Keyword search in
graph to solve diverse discovery needs. We have used AURUM relational databases,” ser. VLDB, 2002.
with several companies. Many others are in the process of [25] A. Simitsis, G. Koutrika et al., “Précis: from unstructured keywords as
queries to structured databases as answers,” VLDB, 2008.
onboarding the technology, and we have already identified [26] Z. Abedjan et al., “DataXFormer: A robust data transformation system,”
some lines of future work, such as the inclusion of other kinds in ICDE, 2016.
of non data-driven relations. We see AURUM as a stepping [27] T. Bleifu, S. Kruse et al., “Efficient denial constraint discovery with
hydra,” VLDB, 2017.
stone towards addressing the substantial challenges that the [28] I. F. Ilyas, V. Markl et al., “CORDS: Automatic Discovery of Correla-
modern flood of data present in large organizations. tions and Soft Functional Dependencies,” in SIGMOD, 2004.
[29] H. V. Nguyen, E. Müller et al., “Detecting Correlated Columns in
1 https://github.com/mitdbg/aurum-datadiscovery Relational Databases with Mixed Data Types,” in SSDBM, 2014.