0% found this document useful (0 votes)
98 views12 pages

Aurum A Data Discovery System

1) Aurum is a data discovery system that builds an enterprise knowledge graph (EKG) to help users navigate disparate data sources. 2) It introduces a two-step process to build the EKG at scale from large, heterogeneous datasets with just one pass over the data. It then uses resource-efficient sampling to maintain the EKG without re-reading all data. 3) Aurum allows flexible querying of the EKG through a source retrieval query language (SRQL) and composable discovery primitives, enabling users to define many different types of discovery queries.

Uploaded by

toto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views12 pages

Aurum A Data Discovery System

1) Aurum is a data discovery system that builds an enterprise knowledge graph (EKG) to help users navigate disparate data sources. 2) It introduces a two-step process to build the EKG at scale from large, heterogeneous datasets with just one pass over the data. It then uses resource-efficient sampling to maintain the EKG without re-reading all data. 3) Aurum allows flexible querying of the EKG through a source retrieval query language (SRQL) and composable discovery primitives, enabling users to define many different types of discovery queries.

Uploaded by

toto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Aurum: A Data Discovery System

Raul Castro Fernandez, Ziawasch Abedjan# , Famien Koko, Gina Yuan, Sam Madden, Michael Stonebraker
#
MIT <raulcf, fakoko, gyuan, madden, stonebraker>@csail.mit.edu TU Berlin [email protected]

Abstract—Organizations face a data discovery problem when and search algorithms to solve a predetermined set of use
their analysts spend more time looking for relevant data than cases; they do not solve the more general discovery problem
analyzing it. This problem has become commonplace in modern presented here. For example, systems such as Goods [1],
organizations as: i) data is stored across multiple storage systems,
from databases to data lakes, to the cloud; ii) data scientists do Infogather [2] and Octopus [3] do not provide a flexible way
not operate within the limits of well-defined schemas or a small to support different discovery queries than those they were
number of data sources—instead, to answer complex questions designed to solve. Analysts want to find datasets according to
they must access data spread across thousands of data sources. To different criteria, e.g., datasets with a specific schema, similar
address this problem, we capture relationships between datasets datasets of one of reference, joinable datasets, etc. To avoid the
in an enterprise knowledge graph (EKG), which helps users to
navigate among disparate sources. The contribution of this paper productivity bottleneck of discovering new data sources, we
is AURUM, a system to build, maintain and query the EKG. To need a more general structure that organizes the data sources,
build the EKG, we introduce a Two-step process which scales to represents their relationships, and allows flexible querying;
large datasets and requires only one-pass over the data, avoiding otherwise data discovery becomes a bottleneck that hampers
overloading the source systems. To maintain the EKG without productivity within organizations.
re-reading all data every time, we introduce a resource-efficient
sampling signature (RESS) method which works by only using a This paper is concerned with the design and implementation
small sample of the data. Finally, to query the EKG, we introduce of a discovery system that permits people to flexibly find
a collection of composable primitives, thus allowing users to relevant data through properties of the datasets or syntactic
define many different types of discovery queries. We describe relationships amongst them, such as similarity of content or
our experience using AURUM in three corporate scenarios and schemas, or the existence of primary key/foreign key (PK/FK)
do a performance evaluation of each component.
links. The key requirements of such a system are: 1) it
I. I NTRODUCTION must work with large volumes of heterogeneous data; 2) it
must incrementally adapt to continuously evolving data; and
With a myriad of data sources spread across multiple 3) it must provide querying flexibility to support the varied
heterogeneous databases, modern organizations face a data discovery needs of users.
discovery problem. Analysts spend more time finding relevant We represent relationships between the datasets in a data
data to answer the questions at hand than analyzing it. structure we call the enterprise knowledge graph (EKG),
For example, consider an analyst at a large drug company which analysts use to solve their discovery problems. The
who is assigned the task of predicting the change in stock contribution of this paper is AURUM, a system to build,
price of the company after a presidential election. To build her maintain and query the EKG:
model, she decides she needs i) the company stock variations • Build: Building the EKG requires performing an all-pairs
in recent elections; ii) the mentions of the company in social comparison (O(n2 ) time) between all datasets for each of the
media channels; iii) the number of drugs about to be approved relationships we wish to find. With a large number of sources
by the FDA; and iv) the current productivity of the research this quadratic cost is infeasible because of its time and because
department. She needs to find tables with all of these attributes, of the need to read external sources multiple times, thus
as well as additional attributes that can be used to link the incurring IO overhead in the external systems. We introduce
different drugs, years, and mentions together. The discovery a two-step process that consists of a profiler that summarizes
challenge is to find data sources containing this information all data from sources into space-efficient signatures by reading
among the many thousands of tables in RDBMS, data lakes data only once, and a graph builder, which finds syntactic
and warehouses in the organization. Doing this manually is relationships such as similarities, PK/FK candidates in O(n)
extremely labor intensive, as the analyst has to browse a large using sketching techniques.
number of files and tables to find those that might match, and • Maintain: Because data is always changing (req. 2), we
then try to find key attributes that can be used to join these must maintain the EKG in a scalable manner: we introduce
tables together. Our work with data analysts at a number of a resource-efficient signature sampling (RESS) method that
organizations, including Merck, British Telecom, the City of avoids repeatedly re-reading source data, while still keeping
New York, and our university suggests that they all struggle the EKG up-to-date.
with such problems on a daily basis. • Query: Finally, because discovery needs are varied we
Although some companies have built in-house solutions to cannot pre-define queries (req. 3). Instead, we introduce a
help their analysts find files in their data lakes and databases, source retrieval query language (SRQL) based on a set of
these systems are engineered around a customized index discovery primitives that compose arbitrarily, allowing users to
express complex discovery queries. We implement SRQL on a SRQL
RDBMS-based execution engine, augmenting it with a graph Query

index, G-Index, which helps speeding up expensive discovery 3


SRQL EKG
path queries. In addition, the query engine allows users to Profiler
Engine Builder
express different ranking criteria for results. 1 2
We have used AURUM to solve use cases within companies. CSV
XML

We report the results of surveys conducted by our users, as JSON


CSV
JSON

XML
SQL
well as performance benchmarks. CSV
XML
JSON
JSON
CSV
XML

II. AURUM OVERVIEW


Here we describe the data discovery problem (II-A), our Fig. 1. Aurum architecture overview
approach to solve it based on the enterprise knowledge graph a very wide set of selection criteria, C. Hence, rather than
(EKG) (II-B), and finish with an overview of AURUM in II-C. trying to prescribe a particular set of discovery or relevance
A. The Data Discovery Problem queries, we aim to make it possible to answer a wide range
of discovery queries by capturing a variety of relationships
Consider the set of structured data sources T within an
between datasets, and allowing users to flexibly query those
organization. A data source is structured when it is comprised
relationships, which we represent in the enterprise knowledge
of a set of columns, which may or may not have a label or
graph (EKG):
attribute name, i.e., tabular data such as CSV files or tables
in databases, file systems, data lakes, and other sources for B. Enterprise Knowledge Graph (EKG)
which a schema can be obtained. Each data source has a set The EKG is a hypergraph where nodes represent columns
of attributes or properties, P , and has different relationships, of the data sources, edges represent relationships between two
R, with other data sources. Note that R may not be explicitly nodes, and hyperedges connect any number of nodes that are
defined in the data; AURUM discovers these relationships while hierarchically related, such as columns of the same table, or
building the EKG. The sources, T , have different degrees of tables of the same source (RDBMS, lake, etc.). In addition,
quality, from highly curated and clean to poorly designed each edge has a weight to indicate the strength of the rela-
schemas with ambiguous names and many missing values. tionship, e.g., how similar two nodes are, or how confident we
Data sources can be human- or machine-generated, and may are of them being a PK/FK. The nodes represent columns and
be intended for human or machine consumption, with schema not individual values because it is more scalable to build and
names ranging from highly descriptive to completely obscure. does not compromise the discovery needs we have identified:
Finding relevant data. Data discovery is the process of although finer granularity (e.g., columns values) is supported
finding a subset S of relevant data sources among the many by AURUM, our experience suggests that the additional storage
sources in the organization, T , which are relevant to a user- costs are not offset by the gain in expressiveness. The edges
supplied discovery query. A data source is relevant to a query connect only two columns to express binary relationships such
when it satisfies a constraint or selection criteria C, of which as content similarity—the values of the columns are similar—
we consider two types. schema similarity, the attribute names being similar, or the
Property constraints select s ∈ S based on P . For example, existence of a PK/FK between them. Hyperedges are necessary
selecting columns with unique values, or columns with a string to allow users query at different granularities, e.g., columns
in the schema name, which are all properties of the data. and tables. A discovery query, then, filters the nodes and edges
For example, the analyst who is building the stock change in the EKG according to property and relation constraints.
prediction model may start with a search for tables that include
metrics of relevance, such as stock prices and mentions in C. Building, Maintaining, Querying
social media (schema similarity). Our contribution is AURUM, a system to build, maintain
Relationship constraint select s based on R and another set and query the EKG; the three processes are depicted in Fig. 1.
of sources, Sref , which is an input parameter to the selection Build the EKG. (Section III-A) A key goal of building the
criteria. In other words, a source, s, is relevant to the user EKG is to minimize the time-consuming access to underlying
because it is related to another source in Sref . For example, the data sources. To build the EKG in as few passes over the
analyst may be interested in finding similar datasets to the ones data as possible, we separate its construction into two stages
found so far to make sure no information is missing (content (the Two-step process), a signature-building stage and a
similarity). Or, having already found a handful of relevant relationship-building stage. During the first stage, a profiler
datasets, the analyst may want to find a join path to join them (box 3 in the figure) produces signatures that represent the
together (a primary-key/foreign-key (PK/FK) candidate). underlying data and stores them in profiles. These profiles
The challenge. One key challenge of data discovery is that are built by reading the data only-once—we use sketches
users’ needs vary and change over time, meaning that a single as explained in the next section—and form the nodes of the
discovery workflow is insufficient to address the breadth of EKG. During the relationship-building stage, a graph builder
discovery tasks, i.e., each user defines relevance according to component (box 2 in the figure) computes the relationships
between the columns of the data by using the signatures in Algorithm 1: Two-Step Process to build the EKG
the profiles, avoiding reading the data again. To avoid the input : C, collection of columns from all databases and tables,
expensive all-pairs comparison (O(n2 )) process, the graph store, a store to save the profiles,
sim thresholds, a list of similarity thresholds
builder computes the edges of the EKG in linear time by output: H = (V, E), where V is the nodes of the EKG.
E = (er |r ∈ R), with R the set of syntactic relationships. H is
transforming the problem into one of approximate nearest the EKG
neighbors (O(n)). // Step 1: signature-building stage
1 for c ∈ C do
Maintain the EKG. (Section III-B) When data changes, we 2 profile← compute profile(c);
want to keep the EKG up-to-date without recomputing from 3 store[c]← profile;
4 add node(H, profile); //A node is represented with its profile
scratch and while minimizing the access to the underlying data
// Step 2: relationship-building stage
sources. Our resource efficient signature sampling (RESS) 5 index name← create indexer(sim thresholds);
determines what data has changed by reading only a sample of 6 index content← create indexer(sim thresholds);
7 for p∈ store.profiles do
the data. AURUM triggers then a re-indexing of these sources, 8 name← p.name;
9 signature← p.signature;
computing their new relationships and updating the EKG. 10 uniqueness ratio← p.uniqueness ratio;
Query the EKG. (Section IV) To query the EKG, users 11 index name.index(p, name);
12 index content.index(p, signature);
express property and relation constraints to obtain the data 13 for p ∈ store.profiles do
they need through a set of composable primitives. Although 14 attr sim candidates = index name.query(p);
15 add edges(H, attr sim candidates, type=”attr similarity”);
the primitives can be implemented in any graph language 16 content sim candidates = index content.query(p);
such as SPARQL, Cypher, etc., we identified a few additional 17 add edges(H, content sim candidates, ”content similarity”);
18 for candidate ∈ content sim candidates do
requirements that justified a separate implementation. Ranking 19 if pkfk candidate = is pkfk candidate(candidate,
results, access to provenance, debugging, and human-latency uniqueness ratio) then
20 add edges(H, pkfk candidate, ”pkfk candidate”);
path query execution all justify the design of an execution
engine for SRQL, which is the name of the data discovery 21 return H;
language we implement in AURUM. SRQL consists of the
discovery primitives, a discovery results set (DRS) to support Both are shown in Algorithm 1. The signature-building stage
the above functionality, as well as an index, G-Index, which (step 1) is carried out by a profiler, which summarizes the data
we use to speed up path queries. into profiles that contain signatures with enough information to
compute the syntactic relationships. The relationship-building
III. B UILDING AND M AINTAINING THE EKG stage (step 2) is carried out by the graph builder, which uses
In this section, we explain how to build the EKG (III-A) the profiles to compute the syntactic relationships that populate
and how to maintain it (III-B). Last we discuss how the EKG the EKG in O(n) time.
evolves in the broader context of an enterprise in III-C. 1) Signature-Building Stage: Profiler: The profiler summa-
Why is building the EKG difficult? To build the EKG, a rizes each data source (column in tables) into a profile, which
naive process would need to read a column in a table of a data maintains information such as content sketches (MinHash),
source, and then iterate over all the other columns of all the cardinality, data distributions, types, etc. The profiler must
other tables, each time comparing both the values—to compute be: i) scalable, because of the sheer amount of data it must
content similarity and PK/FK candidates—as well as the analyze; and ii) IO-efficient, because it interacts directly with
names, to compute schema similarity. This process is clearly data sources that may be used by other production systems.
infeasible because it requires multiple access rounds to the Our profiler builds the profiles by reading data only-once and
data (heavy I/O) and a quadratic number of operations in the scales to large amounts of data by using data parallelism.
number of columns (heavy CPU). So even when the resulting The profiler consists of a pipeline of stages which are
graph may fit in memory, its construction is expensive because wrapped up in a function compute_profile() (see line 2 in
it involves an all-pairs comparison for each relationship of Algorithm 1). Each stage in the pipeline computes a part of the
interest (we consider here content similarity, schema similarity, profile for each column. There are two special stages, source
and candidate PK/FK), and because it incurs too much IO and sink. The source is used to read data from files, RDBMS,
overhead to the external source systems. etc., and provides the input to the compute_profile()
The key to building an EKG efficiently and minimizing IO function. The sink stores the computed profiles (line 3), so
overhead is to avoid the all-pairs comparison. We achieve this that they are accessible to the graph builder during the second
by using a Two-step process. During the first step the process stage of the building process.
uses sketching techniques to summarize the data in one-pass Those operations that are shared across profiler stages, such
only, reducing IO overhead. During the second step, it employs as hashing, are placed at the beginning of the pipeline to save
locality-sensitive hashing (LSH) techniques to transform the computation downstream.
all-pairs problem into an approximate nearest neighbor one.
Parallelism. The profiler exploits parallelism at three different
A. Two-Step Process: Building the EKG levels. First, multiple profilers can run in a distributed fashion,
Our two-step process divides the EKG building process into each processing a disjoint set of data sources. Second, the
a signature-building stage, and a relationship-building one. profiler supports inter-task parallelism, assigning an instance
of the processing pipeline to one thread and running multiple fixed threshold. To obtain an approximation of the relationship
such threads on each machine. It also supports intra-task strength when using LSH, we create an indexer structure
parallelism with a single pipeline instance per machine, but that internally indexes the same signature multiple times,
with multiple threads assigned to one pipeline or stage. in multiple LSH indexes configured with different similarity
Task grain. A natural choice is to assign one pipeline per data thresholds, and configured to balance the probability of false
source (i.e., file or table). This allows for efficient sequential positives and negatives. We create such objects in lines 5 and
reads from the data source. In addition, it reduces the com- 6 of Algorithm 1 passing as parameters a list of configurable
plexity of the profiler because a task corresponds directly to thresholds. Both name and content signatures are indexed in
a table. Unfortunately, this design leads to under-utilization lines 11 and 12 respectively. The next step is then to iterate a
of processing resources, because table sizes in real datasets second time over the profiles (line 13), querying the indexer
are highly skewed. This leads to long-running stragglers that objects (lines 14 and 16) and retrieving candidates. When
hamper full utilization of the available hardware. querying the indexer objects, they, internally, will iterate over
the multiple LSH indexes, starting with the one configured
A more efficient approach can be achieved with finer-
with the highest similarity threshold. As the indexer objects
grain tasks, by partitioning data at the column level. With
return candidates together with their weights (the similarity
this approach, the I/O threads that read the data are now
threshold of the LSH index), they keep the candidates in
decoupled from the processing stages through a queue. A task
internal state, so that they can filter them out when they appear
creator component partitions each column into subtasks, that
in LSH indexes with smaller thresholds. Thus the indexer
are processed by the processing stages. Each thread keeps track
avoids duplicates and obtains an approximate weight for the
of the partial profile computed for each subtask. When the
relationship, which is then materialized in the EKG (lines 15
thread finishes processing a subtask, it sends the partial profile
and 17).
to a merger component, which is responsible for merging all
the subtasks for each column and creating the final profile. Alternative LSH techniques: The above description uses a
This design is robust to data skew and achieves full utilization collection of traditional LSH indexes. Since it must use several
of the available hardware, so it is our preferred approach. indexes for different thresholds, the storage needs increase
with the number of thresholds desired. Although in our deploy-
2) Relationship-building Stage: Graph Builder: The graph
ments storage overhead is not a problem, there are alternative
builder computes the syntactic relationships between the
LSH indexes which help with reducing such overhead, such
columns in the database using the profiles created in the
as LSHForest [5] and MultiProbe-LSH [6]. Integrating these
signature-building stage. The idea is that a relationship be-
indexes in AURUM is straightforward.
tween two profiles reflects a relationship between the under-
lying data. The main problem the graph builder solves is to Approximate PK/FK relationships. To compute PK/FK-
compute such relationships in O(n), with n the number of candidate relationships, we first identify whether a column
columns, avoiding the cost of an all-pairs comparison, for is an approximate key, which we do by measuring the ratio
which it uses locality-sensitive hashing [4]. of unique values divided by the number of total values.
The profiler computes this uniqueness ratio ratio during the
Building EKG syntactic relationships. We want to create a
first stage, and the graph builder retrieves it (line 10 of the
relationship between two nodes of the EKG (e.g., columns)
algorithm). A true primary key will have a uniqueness ratio
if their signatures are similar when indexed in LSH, i.e.,
of 1, but because the profiler uses the Hyperloglog sketch to
they hash into the same bucket. Signatures can be similar
compute this, we may have a small error rate, so we simply
according to two criteria, Jaccard similarity—for which we
check that the value is close to 1. When we retrieve the
use the MinHash signature—or cosine similarity, with a TF-
content-similar candidates, we iterate over them (line 18) and
IDF signature. We can then differentiate both similarity criteria
check whether they are PK/FK candidates (line 19), in which
as two different relationships in the EKG, such as MinHash
case we add the candidate PK/FK relationship to the EKG
and TF-IDF, but we will only talk about one to simplify
(20). This method is similar to PowerPivot’s[7] approach, and
the presentation (signature in algorithm 1 see line 9). As
works well in practice, as shown in the evaluation.
these signatures are computed in the profiling stage, the graph
builder can access them directly from the store; line 7). None B. RESS: Maintaining the EKG
of the relationships computed at this point are binary, they
all have some associated score that indicates how similar two We discuss how to maintain the EKG when data changes
columns are, how likely two columns are a PK/FK, etc. Such without re-reading all data from scratch:
score, the relationship strength allows users to refine their Incrementally maintaining profiler. Consider a column, c,
queries by, for example, using a threshold that all results must for which we compute a MinHash signature, mt at time t. At
surpass. We explain next how we compute such score. time t + 1, we can compute the signature mt+1 . If mt+1 is
Relationship strength score. Each edge in the EKG has a not identical to mt , i.e., if the Jaccard similarity is not 1, we
weight that indicates the strength of the relationship, e.g., know the data has changed, and, by obtaining the magnitude
how much a column is similar to another. LSH, however, will of the change, i.e., the distance, 1 − JS(mt , mt+1 ) between
return elements that are similar based on a pre-determined, the signatures, we can determine whether it is necessary to
Algorithm 2: Resource-Efficient Signature Sampling If the estimated change is above γ (line 9) then the algorithm
input : C, collection of columns from all databases, indicates that it is necessary to recompute c, in line 10.
store, a store to retrieve the profiles,
γ, the maximum magnitude of change tolerable before When sampling, RESS relies on the sampling mechanism
triggering a request to recompute profile of the underlying source, although when possible, it is easy to
output: M , where M ∈ C is the set of columns that need a new
profile add a connector with sampling features.
1 for c ∈ C do
2 s ← random sample(c);
3 x ← store[c].num unique values; C. Additional EKG Features
4 JSmax = |s|/x;
5 smh ← minhash(s); In addition to changing data, other factors contribute to the
6 xmh ← store[c].content signature;
7 JS 0 = jaccard sim(smh , xmh ); continuous evolution of the EKG. We summarize them next:
8 δ = 1 − JS 0 /JSmax ;
9 if δ > γ then •Offline component: Syntactic relationships computed
10 trigger recompute(M , c); with approximate methods, such as the ones explained above,
11 return M ; will contain false positives. It is possible to run an offline
component that checks the exact measurements for the re-
recompute the signature and update the EKG, by checking if lationships existing in the EKG, therefore filtering out the
this difference is larger than a given threshold. The challenge false positives and increasing the precision of the existing
of maintaining the EKG is to detect such changes without ones. Because this component must check only a subset of all
computing mt+1 , because computing this means we need to pairs of columns, its IO demand is lower. There are two more
read the entire column again. scenarios which benefit from the offline component: i) users
Using our Resource Efficient Signature Sampling (RESS) require exact relationships, i.e., no loss in recall; ii) users want
method, given in Algorithm 2, we compute an estimate of the a syntactic relationship which is not compatible with LSH, and
magnitude of change by using only a sample, s, of c, instead therefore must be computed exhaustively.
of the entire data. To do this, we assume that the data has •User Feedback: Along with the SRQL query language
not changed, i.e., that MinHashes mt are identical to those (presented next) we incorporate a set of metadata utilities that
at mt+1 , indicating that the Jaccard similarity of the old and allow users to annotate existing relationships and nodes in the
new columns is 1. We then try to reject this assumption. The EKG, as well as manually creating and removing relationships
Jaccard similarity of two columns, or sets, is expressed as from the EKG. In practice, users cannot directly modify the
their intersection divided by their union. We observe that, if EKG (unless given specific permissions), but their annotations
we have the cardinalities of the columns available (we have are visible to other users. Users with the right permissions can
the number of unique values at time t and because we assume later materialize users’ feedback, thus modifying the EKG.
the columns has not changed, the number of unique values
remains unchanged at t + 1) then we know that the maximum
IV. Q UERYING THE EKG WITH SRQL
Jaccard similarity, JSmax , can be expressed as:
|x ∩ y| min(|x|, |y|) In this section we present the SRQL language IV-A and its
JS = ; JSmax = implementation in IV-B.
|x ∪ y| max(|x|, |y|)
Note that when |x| = |y|, then the intersection and the union A. The Source Retrieval Query Language
are the same, so JSmax = 1. Using this observation, and the
fact that the number of unique values of a sample, s, is always In this section we give a formal definition of the language
lower than that of the original data, c, we expect that, after primitives of SRQL in IV-A1, followed by a running example
sampling, JSmax is given by JSmax = |s|/x, where x is the in IV-A2 and concluding with an in-depth discussion of the
number of unique values of c, which can be obtained from Discovery Result Set (DRS) (in IV-A3).
the previously computed profile. This expression gives us the 1) SRQL Language: The SRQL language consists of two
maximum Jaccard similarity based on the sample. concepts: discoverable elements (DE), which are the nodes of
The RESS algorithm operates as follows. We obtain a sam- the EKG, and discovery primitives (DP), which are functions
ple of c (line 2), proportional to the percentage of the number used to find DEs. DPs are divided into two groups: pure
of unique values for this column (line 3). Then we obtain the and lookup. Pure DPs are closed on the DE set, that is,
JSmax under the assumption the data did not change using the they receive a set of DEs as input and return a set of DEs
observation above, see line 4. We now compute the MinHash as output. Lookup primitives also return a set of DEs, but
signature of s, smh (line 5) and retrieve the MinHash signature receive a string as input. A SRQL query typically consists of a
of c (line 6), which is already available because the profiler combination of pure DPs, sometimes preceded by lookup DPs.
computes it to estimate the content-similarity relationship. To deal with important aspects of the SRQL language such as
The algorithm then computes the new estimate of Jaccard result navigation and ranking, as well as debugging, the SRQL
similarity (line 7), and scales it to a distance between 0 and 1 language implementation uses the concept of a discovery result
(line 8) so that it can be compared with the maximum tolerable set (DRS), which wraps up the output DEs with additional
magnitude of change, γ and input parameter to the algorithm. metadata used to support additional functionality.
2) Running Example: Here we show how an analyst can and attribute names similar to Tx ID—which seems to be an
use SRQL in the different stages of discovery via an example. ID—but with different semantics.
Suppose a bonds analyst finds out an inconsistent value for 3. Incorporating additional intuition. Semantic ambiguity
profits, reported in last quarter’s sales report. The analyst wants is a fact of life in data discovery and integration applications.
to verify the hypothesis that the error was caused by using two SRQL users can write queries that limit ambiguity in some
redundant copies of a table that diverged over time. cases. For example, the analyst may know some values in
1. Broad search of related sources. Retrieve tables with the original table that should also appear in a potential copy,
columns that contain the keywords ”Sales”, ”Profits”: such as LIBORAdjustedBond, which is one product the bond
results = schemaSearch(”Sales”, ”P rof its”) analyst knows about. This value should appear in tables that
The schemaSearch primitive finds columns across all refer to products, differentiating them from other tables that
databases that contain the input keywords. Other lookup also contain a Tx ID field. This additional intuition can be
primitives are available to search, e.g., values. All SRQL used to refine the previous query with the aim of reducing the
queries return a discovery result set (DRS) object, which ambiguity of the results, as follows:
permits inspecting the results as columns, tables, or paths. matchAndDisambig(columns: drs, value: str) =
table(match(columns)) AND table(valueSearch(value))
When inspecting the tables in the result the analyst sees a value = ”LIBORAdjustedBond”
”Profits 3Term 17” and decides to inspect the schema: res = matchAndDisambig(columns(”P rof its_3T erm_17”), value)
results = columns(”P rof its_3T erm_17”) The analyst wants tables with similar columns to the input
columns show the schema of the table, which is Tx ID, table and also contain the value ”LIBORAdjustedBond”. To
product, qty, profit; it seems relevant. Other primitives would not require both conditions to apply to a single column, it is
allow to inspect the tables, instead of the columns. The next possible to configure AND to take the intersection of the results
step is to find similar tables to ”Profits 3Term 17”, that may at the table label and not at the columns level, by using the
have caused the error in the report. primitive table on the inputs to AND. Unfortunately, many
2. Searching for similar content. SRQL provides several results match the query, making it difficult to navigate the
primitives to explore the different syntactic relationships: results. Fortunately, the analyst can rank the results:
contentSim(table: str) = 4. Ranking the results. To find candidate copies, the
drs = columns(table) analyst wants to see the tables that contain the highest number
return jaccardContentSim(drs) OR cosineSim(drs)
OR rangeSim(drs) of columns similar to a column in the input table. Fortunately,
results = contentSim(”P rof its_3T erm_17”) DRS objects (which are explained in the next section) can be
For composability, all SRQL primitives accept columns sorted according to a number of ranking options (developers
unless explicitly changed. The first operation in contentSim can add new options), and are ranked by a default policy.
is to apply columns to the input table. Next, each of the In this case, however, the analyst chooses to rank the results
three primitives, jaccardContent, cosineContent, and of the last query according to coverage, i.e., tables having
rangeSim apply the primitive logic to each input column in- the largest overlap with the source table. This makes the
dividually; the results per column are then combined. Then, an cause of the misleading value in profits apparent: a file
OR primitive is used to combine the results of the primitives— named ”ProfitsSecondTerm17.csv”—with the same columns
the analyst is interested in any relationship that indicates and very similar content—appears in the output. Maybe a user
similarity. SRQL has other primitives to combine results with copied the table from the database to do some analysis, and
other semantics, such as AND and SET-DIFFERENCE. All subsequent updates made this copy of the table diverge. The
primitives operate at the column level. If what is desired is wrong table was used for the report, leading to the initial error.
to obtain intersection or union of tables, then the input DRS 5. Recovering missed information. After more inspection,
can be configured to provide this, as we show later. the analyst realizes that the file contains values that do not
In summary, the query returns columns from the different appear in the original table: removing the copy would poten-
sources in the organization that are similar—according to any tially delete data from the organization. The analyst thinks
of the primitives used—to any column in the input table. To that if the data in the file comes from other sources, it may be
narrow down the result, the analyst refines the query to find possible to recover that information from the origin sources
tables with similar content, and also similar schemas: that contribute values to these tables. To verify this, the analyst
match(columns: drs) = contentSim(columns)
AND attributeNameSim(columns) wants to know whether there is a join path between the table
table = ”P rof its3T erm2017” and the file, with the following SRQL query:
results = match(columns(table))
t1 = ”P rof its_3T erm_17”, t2 = ”P rof itsT hirdT erm17.csv”
The primitive attributeNameSim returns columns with res = path(table(t1), table(t2), pkfkCandidate(), hops=3)
similar names to the input columns. Combining it with the The path primitive finds paths between the first and second
AND primitive yields the intersection of both queries: i.e., only arguments if they are transitively connected with the relation
columns that are content—and attributeName—similar. represented by the primitive in the third argument. In this
By adding the attributeNameSim primitive, the results case, if both tables are connected with a candidate PK/FK
improve, but there are still too many: presumably there are relationship—that exposes candidate primary-key/foreign key
many sources in the databases with columns that have content relations—in fewer than 3 hops, then the path will appear in
the output DRS. Similar to the OR and AND primitives, path can B. SRQL Execution Engine
interpret the input DRS as columns or tables. In this case both We have two requirements for our SRQL execution engine.
input DRS objects are configured as tables. To access the paths, First, queries must run efficiently. Despite having access to
the DRS provides a third convenient view that we mentioned the EKG—which means relationships are pre-computed and
above, paths, that contains all possible 3-hop paths between materialized in the graph—SRQL queries become expensive
t1 and t2 connected with PK/FK candidates. when they involve path queries or large combinations of other
primitives. We explain our solution, based on a G-Index in
Without SRQL, the analyst would have spent large amounts
section IV-B1. Second, the engine must support ranking results
of time asking other people and manually exploring the myriad
in a flexible manner. For that, we explain our on-the-fly
of sources. AURUM simplifies the discovery task and executes
ranking in section IV-B2.
the SRQL queries in human-scale latencies by using the EKG.
1) G-Index: Building the SRQL execution engine on top
We describe in more detail the role of the DRS object next:
of a relational database has the benefit of having the indexing
3) Discovery Result Set: If we trace the execution of a capabilities already available to answer edge primitives, as
SRQL query in the EKG, we obtain a directed graph from well as the chance to store properties on both edges and
the input DEs of the query to the output DEs. The discovery nodes by using the relational model. Executing path primitives,
result set (DRS) keeps track of this provenance graph, which however, can become slow. A path primitive in SRQL returns
we use to answer SRQL queries, navigate the results with all the paths between a source and target DE. The results of the
different granularities, and to rank results on-the-fly. Although primitive can be limited to paths with a configurable maximum
in theory the provenance graph could be as large as the EKG, number of hops. These kinds of graph traversal queries can be
in practice its size is much smaller. For example, none of the expressed in a relational database via self-joins on a relation
queries we used in our evaluation section required more than with source and target node attributes. In practice, however,
500MB of memory (section V-B). We explain next how we even when both attributes have indexes, the query can become
build and use the provenance graph: slow, so we have built G-Index to speed up path primitives.
The G-Index is a space-efficient in-memory representation
Recording Provenance. Recording provenance is crucial for
of the EKG. It is a map, int->(int, bitset), in which
ranking query results. It consists of storing how the SRQL
the nodes of the EKG are represented with their id as int,
query traversed the EKG from the input DEs to produce the
and map to tuples of (int, bitset), in which the first component
output data. The provenance is thus a directed subgraph of the
is the id of the target node, and the bitset stores the edge
EKG that indicates a path from input to output DEs.
information. Specifically, each bit with value 1 in the bitset
The provenance graph has source nodes, which are the input indicates a different kind of relationship between the nodes—
DEs, or virtual nodes to indicate the input parameters of a the EKG presented in this paper requires then only 3 bits for
lookup primitive, and sink nodes, which correspond to the out- the 3 relationships we explain.
put. Every discovery primitive modifies this graph determin- The G-Index can be constructed incrementally along with
istically. Primitives that query relationships add downstream the EKG. In case of a shutdown, it can be efficiently recon-
DEs to existing nodes in the provenance graph—the neighbors structed from scratch, as we report in the evaluation section.
that satisfy the constraint or the DEs within the same hierarchy. Last, no concurrent writes and reads occur in the structure,
Path queries add entire chains of nodes, one per DE involved which is frozen to reads while being constructed. To support
in the path. Finally, set primitives create joins in the graph: evolution of EKG, we can answer queries using an old copy
they take 2 input DEs as input and produce one. of EKG while building the new one.
2) On-the-Fly Ranking: Every SRQL query returns a set
Obtaining SRQL query results. Relationship primitives are
of DEs, which are wrapped up in a DRS object, as we
answered by returning the sink nodes of the provenance graph.
have seen above. Before presenting the results to users, the
Path primitives must return paths that connect a set of source
DRS ranks the results based on a default policy. Despite the
DEs with target DEs; these can be answered by obtaining
default behavior, we want to allow power users to select the
those paths from the DRS. The DRS knows the granularity
most appropriate ranking criteria for the discovery problem
at which the DEs it represents must be interpreted; columns
they have, and we want to avoid re-running queries when
and tables primitives change this granularity.
users change their criteria, making it easy to explore different
Explaining SRQL results. The provenance graph is sufficient ranking strategies. In general, a ranking strategy sorts the
to answer why and how provenance questions, as defined in output results by some measure of relevance with respect to
the literature [8]. It is possible to identify what input DEs some input query or data. In the context of SRQL, this means
determine the existence of a DE in the output by asking a a ranking criteria sorts the output DEs based on the input
why query on the DRS. Asking how will return the entire DEs. The provenance graph is key to enable this function-
path, from source DEs to sink DEs, along with the primitives ality. Because it has the results, it enables decoupling query
that led to its existence. This feature is practically useful for execution from result ranking. That is, by default any SRQL
debugging purposes—when writing a new SRQL query—and query returns an unordered set of DEs, and then a ranking
helpful to shed light on the results of complex SRQL queries. strategy ranks them according to a policy; this is analogous
Use Case Usefulness Time Savings
the discovery use cases that AURUM helped to solve (V-A1)
Lookup primitives and then we report the results of the survey in which we asked
them to evaluate the value of AURUM (V-A3).
Combiner primitives 1) Real Use Cases: University DWH. DWH analysts solve
customers’ questions by finding relevant data and manually
Similarity primitives
creating views. We used AURUM to automate this process. We
Path primitives deployed AURUM on an Amazon EC2 instance from which the
analysts had access to the system.
Ranking The analysts highlighted the benefits of AURUM to quickly
navigate data and inspect schemas. The analysts noted a novel
TABLE I use case of detecting copied tables in the DWH. Copies occur
S URVEY RESULTS . L EFT TO RIGHT: SCORE OF 0 TO 5
because the DWH is built through several ETL processes that
integrate data from various places in the university, leading to
to the ORDER BY operator in SQL. The provenance graph is duplication, which in turn may lead to errors. By writing a
sufficient to express ranking strategies that we have found in SRQL query that finds tables with high content and schema
practice, and it serves as a framework to write new strategies similarity, we found four tables that were candidate copies.
when necessary. Next, we describe two criteria for ranking
results: Pharma company. Our pharmaceutical user ran AURUM on 6
different public chemical databases (ChEMBL [9], DrugBank
Examples of Ranking Strategies. The EKG edges, and by [10], BindingDB [11], Chebi [12], KiDatabase) and an internal
extension the provenance graph edges, have a weight that database accessed by over 1000 analysts. These databases con-
indicates the strength of each relationship. One useful way tain domain-specific information on diverse chemical entities.
of ranking results is to use these weights, traversing the The users remarked that AURUM helped them identify known
provenance graph from sinks to sources, and aggregating the entities in unknown databases, by using the relationships in the
weights we find; when reaching a join, we select the highest EKG. Although initially only interested in finding join paths,
weight of the upstream connections, and follow that path. We they became later interested in writing variations of schema
call this certainty ranking, and we set it up as the default complement in SRQL, again, with the aim of learning about
strategy. It is trivial to aggregate these scores per table to rank the schema of other unfamiliar databases.
tables instead of columns.
A more sophisticated ranking strategy for SRQL queries Sustainability. The sustainability team of our university
that return tables as output is coverage ranking, where output wanted to enrich their own datasets. They used AURUM to
tables of a query are ranked with respect to the input DEs. One navigate a 250GB repository of CSV files from data.gov.
example is showing which output table has the highest number They were interested in finding data about building energy
of columns that are similar to columns in an input table. For consumption, water use intensity, density of LEED buildings,
example, if we find two tables that are content-similar to a power plant CO2 emissions, and waste diversion rates—what
table of interest, we may be interested in knowing how many is diverted from landfill and incinerators. In a 1.5h hands-on
of the columns of the newly found tables are similar to the session during which we assisted them to use AURUM, they
input, and rank the results based on that. The provenance graph found 2 relevant sources for each of the use cases. When they
keeps the necessary data to allow this. found semantic ambiguity—data.gov is full of information
about diverse appliances efficiency—they refined their SRQL
V. E VALUATION queries to avoid such ambiguities, e.g., instead of looking for
To evaluate the value of AURUM for data discovery, we energy efficiency, combine the query with a requirement for
deployed the system with three collaborators with real data KwH in the schema name, avoiding spurious results. As they
discovery problems and surveyed their experience as reported explored the datasets and learned about how different agencies
in section V-A. We then discuss the performance of AURUM at referred to certain efficiency metrics, they refined their SRQL
querying (section V-B), building (section V-C) and maintaining queries, leading to more accurate datasets. They complemented
(section V-D) the EKG. the handful of relevant datasets they had found with similar
content, which they found using AURUM. They highlighted
A. Aurum: Data Discovery in Context the benefits of quickly exploring and refining SRQL queries.
We deployed AURUM in 3 corporate scenarios. In each 2) Survey Results: We surveyed 4 users (2 from pharma
scenario we had to find an aligned interest with the company, and 1 from each of the other deployments) that used AURUM.
develop a use case scenario based on the companies’ dis- We asked them to rate: i) the usefulness of AURUM’s discovery
covery needs and then perform the actual deployment, fixing primitives from 0 (not useful) to 5 (very useful); and, ii)
the problems that each new environment brings. Finally, we time savings—how much time they saved with AURUM in
conducted a survey with the help of a few of the users who had comparison with the manual approach from 0 (no savings) to
used the system consistently and who had suffered directly the 5 (very significant savings).
discovery problems of the company before. First we describe The results are in table I. All users found the different
EKG Neo4J Janusgraph Virtuoso PostgreSQL SRQL G-Index
nodes/edges (avg/95/99) (avg/95/99) (avg/95/99) (avg/95/99) (avg/95/99)
Query Edge 2-hop 5-hop Edge 2-hop 5-hop Edge 2-hop 5-hop Edge 2-hop 5-hop Edge 2-hop 5-hop
8.1 4.4 9,600 2.3. 97.4 4.2 221.3 3,900 0.8 33 18,300 0.7 2.1 10,900
1M/20M 10.5 7.8 13,200 4.0 194.0 7 2.3 1,100 5600 1.1 45 19,700 1.1 2.1 11,300
22.7 8.7 14,200 8.0 228.4 8.4 1,700 6000 1.5 48 19,900 1.5 2.1 11,300
8.1 99 4.2 1,500 1.0 179.5 0.3 108 0.3 112 18,100
100K/20M 10.4 128 7 7.0 1,900 7 1.8 714.0 7 1.4 127 7 1.4 113 18,400
23.5 145 9.0 2,100 2.7 1,100 1.9 130 1.8 113 18,400
9.4 1,400 3.6 8,500 2.5 753.9 0.4 3,200 0.4 1,000 64,400
100K/80M 13.3 3,100 7 6.0 9,800 7 5.2 778.4 7 2.5 3,800 7 2.4 1,100 66,100
23 4,400 10.0 10,400 9.7 778.7 3 3,900 3 1,100 66,300
8.3 7,000 6.1 1.6 4,800 0.2 12,300 0.1 3,300 13,800
10K/20M 10.2 10,900 7 10.0 7 7 2.5 5,300 7 0.2 13,000 7 0.2 3,400 13,900
23.7 11,600 28.0 8.0 5,600 0.9 13,300 0.9 3,400 13,900
8.7 32.9 42.2 1.6 129.4 0.7 6.7 18.5 0.4 11.6 73,500 .02 .02 .01
DWH 15.4 41.4 138.1 3.0 299.8 7 1.1 7.8 62.6 0.9 13.6 92,500 .03 .06 .01
32.9 42.0 206.9 5.0 399.2 1.4 8.1 85.1 1.2 14.1 95,400 .04 .08 .01
8.3 10.8 1,500 0.7 23.4 340 0.5 3.7 25.6 0.2 2.4 72.2 0.2 .02 .02
ChEMBL 11.0 14.5 3,000 2.0 62.4 860 0.9 5.2 43.8 0.4 4.0 144.7 0.4 .05 .04
17.5 14.8 3,100 3.0 80 1,000 1.1 5.8 45.0 0.4 4.7 148.8 0.4 .07 .04
12.0 64 1.6 330.7 1.0 25.7 1.8 199.2 0.7 .02 .01
MassData 14.0 85 7 3.0 802.5 7 2.4 49.6 7 2.2 251.1 7 1.5 .06 .01
20.1 90 6.0 947.6 3.4 65.6 2.5 279.7 2 .08 .01

TABLE II
E DGE AND PATH PRIMITIVE PROCESSING TIMES OF DIFFERENT SYSTEMS ON DIFFERENT GRAPHS . T IMES IN MILLISECONDS .

primitives useful to express their discovery queries. Not all with other specialized graph systems to support this claim.
users used all of them, though. For example, in the case of Also note that specialized graph systems do not support
sustainability, no path primitives were necessary. In that case, some of our requirements, e.g., provenance. We evaluate the
however, ranking was crucial, as the number of sources was performance of G-Index with 3 queries: a 2-hop and 5-hop
larger than in other use cases. In general, we had very positive path query, and an edge primitive. We compare the runtime
experience working with the users and AURUM. with other 4 state-of-the-art systems.
How do they solve discovery today? We asked them Datasets. We use 7 different EKGs, three of them (DWH,
to describe how they solve data discovery problems today, C H EMBL and M ASS DATA) are built on real datasets, and
choosing from 6 options (they could select more than one the other 4 are generated following the Erdos-Renyi model
choice). The options were asking other people or more senior for random graphs, which we use to expose the limits of the
employees where is the data (3 votes), writing SQL queries and systems. DWH is a public subset of the data warehouse of
use database metadata to understand the content (3 votes), our university with around 200 tables. C H EMBL is a well
Manual work: writing scripts, visual inspection (3 votes), Use known public chemical database [9] with around 70 tables.
some commercial tool to navigate the sources (1 vote), use M ASS DATA is a repository of open governmental data for the
some open source tool to navigate the sources (2 votes), and state of Massachusetts, with around 300 tables.
other (0 votes).
Setup. We compare 5 systems in total, including ours. Before
3) End-to-end deployment: In the next section, we evaluate
running a system, we spent time tuning it until we achieved
the performance of each of Aurum’s stages in isolation. In all
the best results. Neo4J [13] is a state-of-the-art single-node
the deployments we have described here, most time is spent
graph processing system. We configured the database with
during the profiling stage, i.e., the first step of our Two-step
indexes to speed up the different queries. All the queries are
process. In addition to being the most computationally expen-
written in Cypher, Neo4J’s language. JanusGraph [14] is an
sive process, reading from source systems is often limited to
open source version of Titan, a distributed graph database
avoid cluttering other workloads.
system. We configured it to run on top of Cassandra, because
Conclusion. Finally, when asked how likely they are to use it was the storage backend that yielded the best results. For
AURUM in their organization, three users gave a 4 and one JanusGraph, we used Gremlin as the query language. Virtuoso
of gave a 5. We are currently working with them to achieve is a well-known database for processing RDF data. In the
a deeper integration in their environments, with the hope of case of Virtuoso, we serialized the EKG as a repository of
discover more details and related problems. The experiments triples in RDF format, as this was the format in which we
so far with AURUM helped us to confirm that AURUM is found Virtuoso to perform the best. We query the system using
useful for varied discovery needs. SPARQL. Last, we run the queries on standalone PostgreSQL
configured with indexes. All systems run on a MacBook Pro
B. Query: SRQL Execution Engine Performance with 8GB of memory.
Next we analyze how the G-Index speeds up discovery Note that, SPARQL in Virtuoso uses an abstraction called
path primitives. Our goal is to show that G-Index on top of property graph—not to be confused with the property graph
a relational database is sufficient to cope with the discovery model [15]—to express path queries. This abstraction does not
needs of querying the EKG, and that specialized graph systems support path primitives defined on more than one edge type:
that run general graph workloads are not needed. We compare this means the system is not suitable for SRQL. We included
5000 3500
Index 3500
4500 TF-IDF
NoIndex-512 3000 3000

Running Time (s)


Running Time (s)
4000 MinHash

Running Time (s)


NoIndex-128
3500 2500 2500
3000
2000 2000
2500
2000 1500
1500
1500 1000
1000 1000 Fine-Grain
500 500
Coarse-Grain
0 500 0
8 16 32 8 16 32 1 2 4 8 16 32
Data Size (#GB) Number of Threads #DEs in 1000

Fig. 2. Profiler runtimes for 3 configurations Fig. 3. Profiler runtime with two grain sizes Fig. 4. Graph builder runtimes with increase DEs
the results for completeness. incrementally—was of 20s for the largest EKG. Finally, G-
Results. We report average, 95, and 99 percentile runtime over Index’s memory footprint is on the order of 200MB in the
10 runs for each query in Table II. A cross ( 7 ) indicates a case of the smallest EKG and below 2GB in the case of the
system run out of memory or 2 min were elapsed. largest. This makes sense, the index, written in C++, must only
Edge primitives SRQL G-Index is 2 orders of magnitude keep an efficient bitmap to represent the connections between
faster than Neo4J, and 1 order of magnitude faster than nodes, which are itself represented with integers.
Virtuoso—edge queries are efficient in the underlying Post- Conclusion. With G-Index, discovery primitives, including
greSQL. Although the absolute runtime is low (10ms), the path queries, execute within human-scale latencies.
runtime will multiply with the total number of input DEs, so
it is important to execute these efficiently. C. Build: Two-step Process
The EKG structure has limited impact on the performance
We measure the performance of the profiler and graph
of the edge primitives, but significantly impacts the runtime
builder to build the EKG (V-C1) as well as the quality of
of path primitives: we ran two discovery path queries, with a
the links produced (V-C2).
maximum of 2 and 5 hops and present the results next.
2-Hop Path primitives On the real data EKGs, G-Index is 1) Profiler and Graph Builder efficiency: To evaluate the
20x faster than Virtuoso, which is the second fastest system. profiler, we measure the time it takes to index varying amounts
On the synthetic EKGs, G-Index performs better, but on-par of data. We ran 3 configurations of the profiler. Using the
with Virtuoso and Neo4J. G-Index speeds up short path queries Index configuration, we ran the profiler and also indexed the
over PostgreSQL (by several orders of magnitude on real data into elasticsearch—necessary to answer lookup primi-
EKGs), while PostgreSQL is competitive with Neo4J and even tives. With NoIndex, we only measure profiling time; we ran
with Virtuoso in the case of the real EKGs. this configuration with a MinHash signature of 512 (NoIndex-
5-Hop Path primitives The biggest impact of G-Index is 512) and 128 permutations (NoIndex-128). We used the
shown with the more complex 5-hop path queries. None DWH dataset, which we replicated to achieve the target
of the other systems completed all the queries over all the input data size, while keeping data skewness. We ran these
EKGs, while G-Index finished them successfully. When the experiments on a machine with 32 cores and SSD disk.
other systems finished the queries, the G-Index performed The results in Fig. 2 show that all modes scale linearly
significantly better. For example, for the real data EKGs, G- with the input data size. The Index mode takes the longest
Index is 2 orders of magnitude faster. For the synthetic ones the time because the indexing speed of elasticsearch becomes
time is similar to Virtuoso and Neo4J for the least connected the processing bottleneck. To factor out this bottleneck and
EKG, but these systems did not finish the query for any of the evaluate the performance of the profiler we built, we run the
other synthetic EKGs. two NoIndex configurations. In the case of NoIndex-512,
Loading time and memory consumption We measured the limiting factor is computing the signature; when we run
the loading times for the different systems. This times include NoIndex-128, which is 4x cheaper to compute than NoIndex-
creating the necessary indexes, for example, creating the G- 512 due to the reduced number of hashing operations, we
Index in the case of our solution. PostgresSQL achieves become limited by the data deserialization routines we use.
the fastest loading times, and G-Index adds negligible over- Nevertheless, these results are very positive. For example, the
head on top when created incrementally—note we do not profiler took only slightly above 1.5 hours to build signatures
need transactional writes to G-Index because the EKG is for the 250GB of data.gov that we used in our deployment
written first and read later, so the process is very efficient. with the sustainability team.
Including the incremental creation of G-Index, PostgreSQL What about task granularity? We want to evaluate the bene-
takes 42s to load the smallest EKG and 160s to load the fits of using finer-grain tasks (Fine-Grain) in comparison with
largest one. The second fastest system, Neo4J, took 134s and creating a task per table (Coarse-Grain). Fig. 3 shows how,
669s, respectively. Virtuoso, which creates many indexes to for the same dataset used above, Fine-Grain achieves better
speed up query execution took 162s and 2234s, which is performance than Coarse-Grain; real datasets are skewed and
14x more time than PostgreSQL. G-Index is an in-memory this justifies the design of our profiler.
structure so after a shutdown, the index must be recomputed. Graph Builder. To evaluate the performance of the graph
The time to recompute G-Index from scratch—as opposed to builder, we measure the runtime as a function of the number
Precision/ Precision/
of input profiles with the content-similarity relationship (DEs Dataset
#Sim. Pairs
Recall Recall
in the x axis). actual (δ = 0.7)
K=512 K=128
The results in Fig. 4 show linear growth of the graph builder DWH 8872 56%/92% 66%/71%
Chembl 11 57%/100% 42%/100%
when using both MinHash and TF-IDF—as expected. To MassData 28297 90%/88% 89%/91%
obtain the TF-IDF signature, the graph builder must read data
from disk, so as to calculate the IDF across the profiles— TABLE III
ACCURACY RESULTS FOR CONTENT SIMILARITY
which explains its higher runtime. MinHash signatures are
created as a part of profiling process, hence its lower runtime. builder, which computes signatures and relationships in linear
2) Accuracy Microbenchmarks: Approximate PK/FK re- time. Despite using approximate methods, the quality of the
lationship. We measure the precision and recall of candidate relationships in the EKG suggests a reasonable approximation
PK/FK generated by AURUM in 4 different datasets for which trade-off against the expensive exact method.
we had ground truth. The results are in Fig. 7.
For TPC-H, our approach achieved similar quality as re- D. Maintain: RESS
ported by PowerPivot [7] and better than reported by Random- In our final experiment, we evaluate the efficiency of our
ness [16]. Similarly, our approach achieved good results for the resource-efficient signature sampling (RESS) method. Our
FDB dataset, which consists of financial information from our goal is to understand whether we can efficiently detect data
university. Chembl [9] contains a larger number of declared changes, so that we can keep the EKG up-to-date at low cost.
PK/FKs in their schema. In this case, we used the technique Synthetic experiment: In the first experiment (Fig. 5), we
presented by PowerPivot [7] of filtering PK/FKs by those created 100 columns with 1K values drawn from a uniform
fields with similar attribute names, which was straightforward distribution. We then change 50 of the columns by uniformly
to express in a simple SRQL query with 2 primitives. After altering in a varying degree the size of the changes (shown in
further inspection of the results, we found that some of the the X axis). We then run RESS and report the precision and
candidate PK/FKs found by AURUM are in fact semantically recall of the method when compared with the real changes.
reasonable joins, even though they are not included in the Both metrics are higher when the data has changed more often.
official PK/FK constraints defined by the schema (confirmed Even when the data does not change much, the recall is very
by an expert). When they were false positives, users found it high, and the precision means that only a few additional data
easy to ignore. Finally, we evaluated our method on a dataset must be re-read to verify. To further evaluate RESS, we run a
used by one of our collaborators, and accessed by hundreds of second experiment with real data.
analysts daily (Pharma). In this case both the precision and Real data: We obtained two versions of the C H EMBL
recall of the relationships we found was above 90%, despite database. We used version 21 (which we have used in the
the large number of relationships available. rest of this evaluation) and version 22. ChEMBL updates
Content similarity relationship. We compare the quality of their database version every few months, adding and removing
the content similarity relationships generated with MinHash new data to the existing relations, so we built an EKG
and LSH with ground truth obtained by running an exact all- using ChEMBL21 and then changed the underlying data to
pairs similarity method. We use 3 datasets (we did not use ChEMBL22, therefore simulating data evolution. At this point,
Pharma because we could not run the expensive all-pairs we run EKG to identify the magnitude of change for the
method in the production environment). The results are in columns in the dataset. We then run RESS with different
table III. We do not show the results for for TPC-H and FDB, sample sizes and compute the absolute error between the
for which our approach achieved 100% precision and recall. estimated magnitude of change and the true change—which
Instead, we add the M ASS DATA dataset and show results we compute and use as ground truth (420 of the underlying
for MinHash with 512 and 128 permutations. For all three columns changed). The results in Fig. 6 show the error in the
datasets we achieve a high recall, with 100% in the case of x axis and the percentage of columns in the y axis.
ChEMBL and 88% in the case of M ASS DATA. The precision RESS identifies 90% of the modified datasets by reading
for M ASS DATA is also high, while for DWH and C H EMBL only 10% of the original data, with a maximum error of less
we achieved a precision of around 50%—note that we are than 15%. When using 25% of the data the error reduces to
measuring exactly those relationships above the configured 10%. The baseline error is of 5%. For the rest of sample sizes
0.7 similarity threshold—we verified that many of the false shown in the figure, RESS behaves as expected: the bigger the
positives actually have a high similarity score. sample size the lower the error.
Even when we use the cheaper-to-compute MinHash sig- VI. R ELATED W ORK
nature with 128 permutations, we still achieve high quality
results. This is important, because computing the MinHash Enterprise search systems. LinkedIn has open sourced
signature takes a considerable time during the profiling stage WhereHows [17]. The Apache foundation offers Atlas [18]
as we have seen in the previous section. and Google has Goods [1]. All these systems have a strong
In conclusion, our two-step process to build the EKG scales emphasis on recording the lifecycle of datasets through lineage
linearly with the input data by means of the profiler and graph information, which is orthogonal to the problem we solve
1
100
90 #FKs
0.8
80 Dataset Precision/Recall
actual

% columns
0.6
Score

60 TPC-H 11 100%/81%
0.4 sample-100%
40
sample-75%
FDB 7 100%/100%
0.2 sample-50%
Precision
Recall
20 sample-25% Chembl 85 35%/67%
0 sample-10%
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 Pharma 431 100%/91%
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Perturbation Ratio
Error Fig. 7. Effectiveness of the PK/FK discovery
Fig. 5. Precision and Recall as a function of change
Fig. 6. Absolute error for different sample sizes.
with 10% sample
in this paper. These systems provide some subset of search Acknowledgements: We thank the many users and industrial
or discovery features, such as finding similar datasets, or collaborators that participated in the user studies we conducted
inspecting existing ones, but none of them permit users to for the evaluation.
change the discovery query on-demand.
R EFERENCES
Exploratory systems. Octopus [3] and Infogather [2] solve
[1] A. Halevy et al., “Goods: Organizing Google’s Datasets,” in SIGMOD,
the problem of schema complement. Finding related tables [19] 2016.
focuses on finding tables that are candidates for joins, similar [2] M. Yakout et al., “InfoGather: Entity Augmentation and Attribute
to schema complement, or candidates for union. Finally, Discovery by Holistic Matching with Web Tables,” in SIGMOD, 2012.
[3] M. J. Cafarella, A. Halevy et al., “Data Integration for the Relational
in [20] the authors build a system with a particular set of Web,” VLDB, 2009.
operations that solve use cases in the context of oceanographic [4] A. Andoni and P. Indyk, “Near-optimal Hashing Algorithms for Approx-
data. All the above systems rely on custom indexes, specific imate Nearest Neighbor in High Dimensions,” Commun. ACM, 2008.
[5] M. Bawa, T. Condie, and P. Ganesan, “LSH Forest: Self-tuning Indexes
to each use case. In contrast, we take a more general approach for Similarity Search,” in WWW, 2005.
to the data discovery problem, building an EKG that we use [6] Q. Lv, W. Josephson et al., “Multi-probe LSH: Efficient Indexing for
to flexibly answer varied discovery needs, and that can then High-dimensional Similarity Search,” in VLDB, 2007.
[7] Z. Chen, V. Narasayya et al., “Fast Foreign-key Detection in Microsoft
be used as the basis for new, more complex relationships. SQL Server PowerPivot for Excel,” VLDB, 2014.
IR and databases. There is a plethora of systems to perform [8] J. Cheney, L. Chiticariu et al., “Provenance in Databases: Why, How,
and Where,” Found. Trends databases, 2009.
keyword search in RDBMS [21], [22], [23], [24], [25], [26]. [9] E. L. Willighagen, A. Waagmeester et al., “The chembl database as
Most of these systems can be seen as a constrained imple- linked open data,” Journal of Cheminformatics, 2013.
mentation of our discovery primitives, on an EKG with DE [10] D. S. Wishart, C. Knox et al., “DrugBank: a knowledgebase for drugs,
drug actions and drug targets.” Nucleic Acids Research, 2008.
granularity of values. Then, most of these systems propose to [11] T. Liu et al., “BindingDB: a web-accessible database of experimentally
connect the results through a Steiner tree, so that they can rank determined proteinligand binding affinities,” NAR, 2007.
them appropriately. Since we maintain a provenance graph of [12] J. Hastings et al., “The ChEBI reference database and ontology for
biologically relevant chemistry: enhancements for 2013,” NAR, 2012.
the execution of SRQL queries, we could find a Steiner tree in [13] Neo4J, “Neo4J,” https://neo4j.com, 2017.
such graph, therefore offering similar functionality. We have [14] JanusGraph, “JanusGraph: Distributed graph database,”
built AURUM for more general discovery cases that cannot be https://janusgraph.org, 2017.
[15] W. Sun, A. Fokoue et al., “SQLGraph: An Efficient Relational-Based
solved with keyword search alone. Property Graph Store,” in SIGMOD, 2015.
[16] M. Zhang, M. Hadjieleftheriou et al., “On multi-column foreign key
Other links for the EKG. We designed Aurum to support discovery,” VLDB, 2010.
relationships useful to discover and connect datasets within [17] LinkedIn, “WhereHows: A Data Discovery and Lineage Portal,” 2016.
organizations. We discussed some syntactic relationships but [18] Apache, “ATLAS: Data Governance and Metadata framework for
Hadoop,” http://atlas.incubator.apache.org, 2016.
left many others out. For example, work on finding functional [19] D. Sarma et al., “Finding Related Tables,” in SIGMOD, 2012.
dependencies [27], as well as correlation among columns [28], [20] V. M. Megler et al., “Are Data Sets Like Documents?: Evaluating
[29] is complementary to our work. Similarity-Based Ranked Search over Scientific Data,” TKDE, 2015.
[21] K. Chen, J. Madhavan, and A. Halevy, “Exploring schema repositories
with schemr,” 2009.
VII. C ONCLUSION [22] S. Agrawal, S. Chaudhuri et al., “DBXplorer: Enabling Keyword Search
over Relational Databases,” in SIGMOD, 2002.
We presented AURUM1 , a data discovery system that builds, [23] B. Aditya, G. Bhalotia et al., “BANKS: Browsing and Keyword Search-
ing in Relational Databases,” in VLDB, 2002.
maintains and allows users to query an enterprise knowledge [24] V. Hristidis and Y. Papakonstantinou, “Discover: Keyword search in
graph to solve diverse discovery needs. We have used AURUM relational databases,” ser. VLDB, 2002.
with several companies. Many others are in the process of [25] A. Simitsis, G. Koutrika et al., “Précis: from unstructured keywords as
queries to structured databases as answers,” VLDB, 2008.
onboarding the technology, and we have already identified [26] Z. Abedjan et al., “DataXFormer: A robust data transformation system,”
some lines of future work, such as the inclusion of other kinds in ICDE, 2016.
of non data-driven relations. We see AURUM as a stepping [27] T. Bleifu, S. Kruse et al., “Efficient denial constraint discovery with
hydra,” VLDB, 2017.
stone towards addressing the substantial challenges that the [28] I. F. Ilyas, V. Markl et al., “CORDS: Automatic Discovery of Correla-
modern flood of data present in large organizations. tions and Soft Functional Dependencies,” in SIGMOD, 2004.
[29] H. V. Nguyen, E. Müller et al., “Detecting Correlated Columns in
1 https://github.com/mitdbg/aurum-datadiscovery Relational Databases with Mixed Data Types,” in SSDBM, 2014.

You might also like