Querying Big Data: Bridging Theory and Practice: FN LN AC City Salary Status T T T S S
Querying Big Data: Bridging Theory and Practice: FN LN AC City Salary Status T T T S S
of Informatics, University of Edinburgh, 10 Crichton Street, Edinburgh, EH8 9AB, United Kingdom
Research Center on Big Data, Beihang University, Beijing, No.37 XueYuan Road, 100083, Beijing, China
Abstract
Big data introduces challenges to query answering, from theory to practice. A number of questions arise. What queries are
tractable on big data? How can we make big data small so that it is feasible to find exact query answers? When exact answers
are beyond reach in practice, what approximation theory can help us strike a balance between the quality of approximate query
answers and the costs of computing such answers? To get sensible query answers in big data, what else do we necessarily do in
addition to coping with the size of the data? This position paper aims to provide an overview of recent advances in the study of
querying big data. We propose approaches to tackling these challenging issues, and identify open problems for future research.
Keywords: Big data, query answering, tractability, distributed algorithms, incremental computation, approximation, data quality.
1. Introduction
Big data is a term that is almost as popular as internet
was back 20 years ago. It refers to a collection of data sets
so large and complex that it becomes difficult to process using traditional database management tools or data processing
applications [92]. More specifically, big data is often characterized with four Vs: Volume for the scale of the data, Velocity for its streaming or dynamic nature, Variety for its different
forms (heterogeneity), and Veracity for the uncertainty (poor
quality) of the data [64]. Such data comes from social networks (e.g., Facebook, Twitter, Sina Weibo), e-commerce systems (e.g., Amazon, Taobao), finance (e.g., stock transactions),
sensor networks, software logs, e-government and scientific research (e.g., environmental research), just to name a few, where
data is easily of PetaByte (PB, 1015 bytes) or ExaByte (EB,
1018 bytes) size. The chances are that big data will generate as
big impacts on our daily lives as internet has done.
New challenges. As big data researchers, we do not confine
with the general characterization of big data. We are more interested in what specific technical problems or research issues
big data introduces to query answering. Given a dataset D and
a query Q, query answering is to find the answers Q(D) to Q
in D. Here Q can be an SQL query on relational data, a keyword query to search documents, or a personalized social search
query on social networks (e.g., Graph Search of Facebook [29]).
Example 1. A fraction D0 of an employee dataset of a company is shown in Figure 1. Each tuple in D0 specifies the firstname (FN), last name (LN), salary and marital status of an employee, as well as the area code (AC) and city of her office. A
query Q0 is to find distinct employees whose first name is Mary.
Email addresses: [email protected] (Wenfei Fan),
[email protected] (Jinpeng Huai)
Preprint submitted to Elsevier
t1 :
t2 :
t3 :
s1 :
s2 :
FN
Mary
Mary
Mary
Bob
Robert
LN
Smith
Webber
Webber
Luth
Luth
AC
20
10
10
212
212
city
Beijing
Beijing
Beijing
NYC
NYC
salary
50k
50k
80k
80k
55k
status
single
married
married
married
married
Such a query can be expressed in, e.g., relational algebra, written as FN = Mary R0 by using selection operator [1], where
R0 is the relation schema of D0 . To answer the query Q0 in
D0 , we need to find all tuples in D0 that satisfy the selection
condition: FN = Mary, i.e., tuples t1 , t2 and t3 .
In the context of big data, query answering becomes far more
challenging than what we have seen in Example 1. The new
complications include but are not limited to the following.
Data. In contrast to a single traditional database D0 , there are
typically multiple data sources with information relevant to our
queries. For instance, a recent study shows that many domains
have tens of thousands of Web sources [22], e.g., restaurants,
hotels, schools. Moreover, these data sources often have a large
volume of data (e.g., of PB size) and are frequently updated.
They have different formats and may not come with a schema,
as opposed to structured relational data. Furthermore, many
data sources are unreliable: their data is typically dirty.
Query. Queries posed on big data are no longer limited to our
familiar SQL queries. They are often for document search, social search or even for data analysis such as data mining. Moreover, their semantics also differs from traditional queries. On
one hand, it can be more flexible: one may want approximate
answers instead of exact answers Q(D). On the other hand,
one could ask query answering to be ontology-mediated by coupling datasets with a knowledge base [12], or personalized and
August 6, 2014
ter how large the underlying dataset D is. For instance, when
Q is a Boolean conjunctive query (a.k.a. SPC query [1]), we
need at most ||Q|| tuples from D to answer Q, independent of
the size of D, where ||Q|| is the number of tuples in the tableau
representation of Q. This is also the case for many personalized social search queries. Intuitively, if a core DQ of D for
answering Q has a bounded size, then Q is scale independent
in D [36], i.e., we can efficiently compute Q(D) no matter how
big D is. This suggests that we study how to determine whether
a query is scale independent in a dataset.
context-aware [86] such that the same query gets different answers when issued by different people in different locations.
These tell us that query answering in big data is a departure from our familiar terrain of traditional database queries. It
raises a number of questions. Does big data give rise to any new
fundamental problems? In other words, do we need new theory
for querying big data? Do we need to develop new methodology for query processing in the context of big data? What practical techniques could we use to cope with the sheer volume of
big data? In addition to the scalability of query answering algorithms, what else do we have to pursue in order to find sensible
or even correct query answers in big data?
In addition, we develop several practical techniques for making big data small. These include (a) distributed query
processing by partial evaluation [47], with provable performance guarantees on both response time and network traffic;
(b) query-preserving data compression [45]; (c) view-based
query answering [50]; and (d) bounded incremental computation [49, 82]. All these techniques allow us to compute Q(D)
with a cost that is not a function of the size of D, and have
proven effective in querying social networks. The list is not exclusive: there are many other techniques for making big data
small and hence, making queries feasible on big data.
answers to Q, such that Q(DQ ) is within a performance ratio to the exact answer Q(D). We explore the connection
between the resolution and the quality bound , to strike a
balance between the computation cost and the quality of the
approximate answers. Our preliminary study [52] has shown
that for personalized social search queries, the performance ratio remains 100% even when the resolution is as small as
0.0015% (15106). That is, we can reduce D of 1PB to DQ
of 15GB, while still retaining exact answers for such queries!
Below we present a notion of BD-tractable queries. We encourage the interested reader to consult [38] for details.
Preliminaries. We start with a review of two well-studied complexity classes (see, e.g., [58, 67] for details).
The complexity class P consists of all decision problems
that can be solved by a deterministic Turing machine in
polynomial time (PTIME), i.e., in nO(1) time, where n is
the size of the input (dataset D and query Q in our case).
The parallel complexity class NC, known as Nicks Class,
consists of all decision problems that can be solved by taking O(logO(1) n) time on a PRAM (parallel random access
machine) with nO(1) processors.
From the example we can see that when the datasets are dirty,
we cannot trust the answers to our queries in those datasets. In
other words, no matter how big datasets we can handle and
how fast our query processing algorithms are, the query answers computed may not be correct and hence may be useless!
Unfortunately, real-life data is often dirty [33], and the scale of
data quality problems is far worse in the context of big data,
since real-life data sources are often unreliable. Therefore, the
study of the quality of big data is as important as techniques for
coping with its quantity; that is, big data = quantity + quality!
This motivates us to study the quality of big data. We consider five central issues of data quality: data consistency [34],
data accuracy [16], information completeness [32], data currency [40] and entity resolution [31], from theory to practice.
We study how to repair dirty data [20, 43, 46] and how to deduce true values of an entity [39], among other things, emphasizing new challenges introduced by big data.
Organization. The remainder of the paper is organized as follows. We start with BD-tractability in Section 2. We study scale
independence and present several practical techniques for making queries BD-tractable in Section 3. When BD-tractable algorithms for computing exact query answers are beyond reach in
practice, we study approximate query answering in Section 4,
by proposing query-driven approximation and data-driven approximation. We study the other side of big data, namely, data
quality, in Section 5. Finally, Section 6 concludes the paper.
3
Intuitively, function () preprocesses D and generates another structure D = (D) offline, in PTIME. After this, for all
queries Q Q that are defined on D, Q(D) can be answered by
evaluating Q(D ) online in NC, i.e., in parallel polylog-time.
A breadth-depth search starts at a node s and visits all its children, pushing them onto a stack in the reverse order induced by
the vertex numbering as the search proceeds. After all of ss
children are visited, the search continues with the node on the
top of the stack, which plays the role of s.
In the problem statement of BDS given above, the entire input, i.e., x = (G, (u, v)), is treated as a query, while its data part
is empty. In this setting, there is nothing to be preprocessed.
Moreover, it is known that BDS is P-complete (cf. [58]), i.e., it
is the hardest problem in the complexity class P. Unless P =
NC, such a query cannot be processed in parallel polylog-time.
In other words, this class of BDS queries is not in BDT0 unless
P = NC. It is also known that the question whether P = NC is
as hard as our familiar open question whether P = NP.
Nonetheless, there exists a re-factorization (1 , 2 , ) of its
instances x = (G, (u, v)) that identifies G as the data part and
(u, v) as the query part. More specifically, 1 (x) = G, 2 (x) =
(u, v), and maps 1 (x) and 2 (x) back to x. Given this,
Open issues. There has been a host of recent work on revising the traditional complexity theory to characterize dataintensive computation on big data. The revisions are defined in
terms of computational costs [38], communication (coordination) rounds [61, 71], or MapReduce steps [69] and data shipments [3] in the MapReduce framework [23]. Our notions of
BD-tractability focus on computational costs [38]. The study is
still preliminary, and a number of questions remain open.
(1) The first question concerns what complexity class precisely
characterizes online query processing that is feasible on big
data. As a starting point we adopt NC because (a) NC is considered highly parallel feasible [58]; (b) parallel polylog-time is
feasible on big data; and (c) many NC algorithms can be implemented in the MapReduce framework [69], which is being used
in cloud computing and data centers for processing big data.
However, NC is defined in the PRAM model, which may not be
accurate for real-life parallel frameworks such as MapReduce.
NC
has been defined for BDT,
A form of NC-reductions 6fa
NC
which is transitive (i.e., if Q1 6NC
fa Q2 and Q2 6fa Q3 then
NC
NC
Q1 6fa Q3 ) and compatible with BDT (i.e., if Q1 6fa
Q2 and Q2 is in BDT, then so is Q1 ). Similarly, NCreductions have been defined for BDT0 with these properties. In contrast to our familiar PTIME-reductions for NP
problems (see, e.g., [81]), these reductions require a pair
of NC functions, i.e., both are in parallel polylog-time.
(2) The second question concerns the complexity of preprocessing. Let us use PQ[CP , CQ ] to denote the set of all query classes
that can be answered by preprocessing the data sets in the complexity class CP and subsequently answering the queries in CQ .
Then BDT0 can be represented by PQ[P, NC]. One may consider other complexity classes CP instead of P. For instance,
one may consider PQ[NC, NC] by requiring the preprocessing
step to be conducted more efficiently; this is not very interesting
since PQ[NC, NC] coincides with NC. On the other hand, one
may want to consider CP beyond P, e.g., NP and PSPACE (i.e.,
PQ[NP, NC] and PQ[PSPACE, NC]). This is another debatable
issue that demands further study. No matter what PQ[CP , CQ ]
we use, one has to strike a balance between its expressive power
and computational cost in the context of big data.
These results are not only of theoretical interest, but also provide guidance for us to answer queries in big data. For instance,
given a query class Q, we can conclude that it can be made BDtractable if we can find a 6NC
fa reduction to a complete query
class Qm of BDT. If so, we are warranted an effective algorithm for answering queries of Q in big data. Indeed, such an
algorithm can be developed by simply composing the NC reduction and an NC algorithm for processing Qm queries; then
the algorithm remains in parallel polylog-time.
One may ask what query classes may not be made BDtractable. The results above also tell us the following: unless
P = NP, all query classes for which the membership problem is
NP-hard are not in BDT. The membership problem for a query
Example 6 [36] Some real-life queries are actually scale independent. For example, below are (slightly modified) personalized search queries taken from Graph Search of Facebook [29].
The idea is simple. But to implement it, we need to settle several fundamental questions and develop practical techniques.
Below we first study questions concerning whether it is possible at all to find a small dataset D such that Q(D) = Q(D ).
We then present several practical techniques to make big data
small, which have been evaluated by using social network analysis as a testbed, and have proven effective in the application.
3.1. Scale Independence
We start with fundamental problems associated with the approach to making big data small. We first study the existence
of a small subset D of D such that we can answer Q in D by
accessing only the data in D . We then present effective methods for identifying such a D . We invite the interested reader to
consult [36] for a detailed report on this subject.
Q(DQ ) = Q(D).
That is, to answer Q in D, we need only to fetch at most M
tuples from D, regardless of how big D is. We refer to DQ as
a core for answering Q in D. Note that DQ may not be unique.
As will be seen shortly, we want to find a minimum core.
One step further, we say that Q is scale independent for R
w.r.t. M if for all instances D of R, Q is scale independent in
D w.r.t. M , i.e., one can always find a core DQ with at most M
tuples for answering Q in D.
where ||Q|| is the number of tuple templates in the tableau presentation of the conjunctive query Q [1]. Here Q is Boolean
if for any instance D of R, Q(D) returns true if Q(D) is
6
Example 7. Continuing with Example 6, we would have a tuple (friend, id1 , 5000, T ) for some value T in the access schema
A. That is, there exists an index on id1 such that if id1 is provided, at most 5000 tuples with such an id exist in friend, and it
takes time T to retrieve those. In addition, we would have a tuple (person, id, 1, T ) in A, indicating that id is a key for person
with a known time T for retrieving the tuple for a given id.
Computing a core by leveraging access schema. Given a relational schema R, we say that a query Q is scale independent
under access schema A if for all instances D of R that conform
to A, the answer Q(D) can be computed in time that depends
only on A and Q, but not on D. That is, Q is scale independent for R in the presence of A, independent of the size of the
underlying D. The following results are known.
Graph pattern matching. We start with a review of graph pattern matching in social graphs, which typically represent social
networks, e.g., Facebook, Twitter, LinkedIn.
7
Distributed query processing with partial evaluation. Distributed query processing is perhaps the most popular approach
to querying big data, notably MapReduce [23]. Here we advocate distributed query processing with partial evaluation.
Partial evaluation has been used in a variety of applications
including compiler generation, code optimization and dataflow
8
This section proposes two approaches to developing approximation algorithms for answering queries in big data, referred
to as query-driven and data-driven approximation.
(1) As we have seen in Section 3.1, access schemas help us determine whether a query is scale independent and if so, develop
an efficient plan to evaluate the query. A practical question asks
how to design an optimal access schema for a given query
workload, such that we can answer as many given queries as
possible by accessing a bounded amount of data.
opt(x)
if A is a minimization problem
m(x, y)
R(x, y) = m(x, y)
|Y Q(D)|
,
|Y |
|Y Q(D)|
recall(Q, D, Y ) =
.
|Q(D)|
precision(Q, D, Y ) =
precision(Q, D, Y ) recall(Q, D, Y )
precision(Q, D, Y )+recall(Q, D, Y )
For such patterns, we have developed resource-bounded approximation algorithms for graph pattern matching defined in
terms of subgraph isomorphism and graph simulation (see Section 3.2). We have experimented with these algorithms using
real-life social graphs. The results are very encouraging. We
find that our algorithms are efficient: they are 135 and 240
times faster than traditional pattern matching algorithms based
on graph simulation and subgraph isomorphism, respectively,
Better still, the algorithms are accurate: even when the resource
ratio is as small as 15106, the algorithms return matches
with 100% accuracy! Observe that when G consists of 1PB of
data, |G| is down to 15GB, i.e., resource-bounded approximation truly makes big data small, without paying too high a price
of sacrificing the accuracy of query answers.
In contrast, resource-bounded approximation adopts a dynamic reduction strategy, which finds a small dataset DQ with
only information needed for an input query Q, and hence, allows higher accuracy within the bound |D| on data accessed.
One can use any techniques for dynamic reduction, including
those for data synopses such as sampling and sketching, as long
as the process visits a bounded amount of data in D.
(3) The third topic is to develop resource-bounded approximation algorithms in various application domains. For instance,
for social searches that are not personalized, i.e., when no nodes
in a graph pattern are designated to map to fixed nodes in a social graph G, can we develop effective resource-bounded approximation algorithms for graph pattern matching?
(4) Finally, approximation classes for resource-bounded approximation need to be defined, along the same lines as
their counterparts for traditional approximation algorithms
(e.g., APX, PTAS, FPTAS [21]). Similarly, approximationpreserving reductions should be developed, and complete prob-
then t[AC] must be 10. As a data quality rule, this CFD catches
the inconsistency in tuple t1 : t1 [AC] and t[city] violate the CFD.
There has been recent work on data accuracy [16]: given tuples t1 and t2 pertaining to the same entity e, we decide whether
t1 is more accurate than t2 in the absence of the true value of e.
It is also based on integrity constraints as data quality rules.
Data deduplication aims to identify tuples in one or more relations that refer to the same real-world entity. It is also known as
entity resolution, duplicate detection, record matching, record
linkage, merge-purge, database hardening, and object identification (for data with complex structures such as graphs).
Data consistency refers to the validity and integrity of data representing real-world entities. It aims to detect inconsistencies
or conflicts in the data. For instance, tuple t1 of Figure 1 is
inconsistent: its area code is 20 while its city is Beijing.
For example, consider tuples t1 , t2 and t3 in Figure 1. To answer query Q0 of Example 1, we want to know whether these
tuples refer to the same employee Mary. The answer is affirmative if, e.g., there exists another relation which indicates that
Mary Smith and Mary Webber have the same email account.
Inconsistencies are identified as violations of data dependencies (a.k.a. integrity constraints [1]). Errors in a single relation can be detected by intrarelation constraints such as conditional functional dependencies (CFDs) [34], while errors across
different relations can be identified by interrelation constraints
such as conditional inclusion dependencies (CINDs) [75]. An
example CFD for the data of Figure 1 is: city = Beijing
AC = 10, asserting that for any tuple t, if t[city] = Beijing,
we want to decide whether D has complete information to answer an input query Q, among other things.
Data repairing. After the errors are detected, we want to automatically localize the errors and fix the errors. We also need
to identify tuples that refer to the same entity, and for each entity, determine its latest and most accurate values from the data
in our database. When some data is missing, we need to decide
what data we should import and where to import it from, so that
we will have sufficient information for tasks at hand.
This highlights the need for data repairing [5]. Given a set
of dependencies and an instance D of a database schema R, it
is to find a candidate repair of D, i.e., another instance D of R
such that D satisfies and D minimally differs from the original database D. The data repairing problem is, nevertheless,
highly nontrivial: it is NP-complete even when a fixed set of
traditional functional dependencies (FDs) or a fixed set of inclusion dependencies (INDs) is used as data quality rules [14]. In
light of these, several heuristic algorithms have been developed,
to effectively repair data by employing FDs and INDs [14],
CFDs [20, 96], CFDs and MDs [46] as data quality rules.
More specifically, given a database D, the discovery problem is to find a minimal cover of all dependencies (e.g., CFDs,
CINDs, MDs) that hold on D, i.e., a non-redundant set of dependencies that is logically equivalent to the set of all dependencies that hold on D. Several algorithms have been developed for discovering CFDs and MDs (e.g., [18, 35, 55]).
Validating data quality rules. A given set of dependencies,
either automatically discovered or manually designed by domain experts, may be dirty itself. In light of this we have to
identify consistent dependencies from , i.e., those rules that
make sense, to be used as data quality rules. Moreover, we need
to remove redundancies from via the implication analysis of
the dependencies, to speed up data cleaning process.
one site to another. In this setting, error detection with minimum data shipment or minimum response time becomes NPcomplete [37], and the SQL-based techniques no longer work.
For distributed data, effective batch algorithms [37] and incremental algorithms [44] have been developed for detecting
errors, with certain performance guarantees. However, rule discovery and data repairing algorithms remain to be developed
for distributed data. These are highly challenging. For instance,
data repairing for centralized databases is already NP-complete
even when a fixed set of FDs is taken as data quality rules [14],
i.e., when only the size |D| of datasets is concerned (a.k.a. data
complexity [1]). When D is of PB size and D is distributed, its
computational and communication costs are prohibitive.
From the example we can see that to deduce the true values of an entity, we need to combine several techniques: data
deduplication, data consistency and data currency, among other
things. This can be done in a uniform logical framework based
on data quality rules. There has been recent preliminary work
on the topic [39]. Nonetheless, there is much more to be done.
References
[1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. AddisonWesley, 1995.
[2] S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses
for approximate query answering. In SIGMOD, pages 275286, 1999.
[3] F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT, pages 99110, 2010.
[4] S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica.
BlinkDB: queries with bounded errors and bounded response times on
very large data. In EuroSys, pages 2942, 2013.
[5] M. Arenas, L. Bertossi, and J. Chomicki. Consistent query answers in
inconsistent databases. In PODS, pages 6879, 1999.
[6] M. Armbrust, K. Curtis, T. Kraska, A. Fox, M. J. Franklin, and D. A.
Patterson. PIQL: Success-tolerant query processing in the cloud. PVLDB,
5(3):181192, 2011.
[7] M. Armbrust, A. Fox, D. A. Patterson, N. Lanham, B. Trushkowsky,
J. Trutna, and H. Oh. SCADS: Scale-independent storage for social computing applications. In CIDR, 2009.
[8] M. Armbrust, E. Liang, T. Kraska, A. Fox, M. J. Franklin, and D. Patterson. Generalized scale independence through incremental precomputation. In SIGMOD, pages 625636, 2013.
Cleaning data with complex structures. Data quality techniques have been mostly studied for structured data with a regular structure and a schema, such as relational data. When it
comes to big data, however, data typically has an irregular structure and does not have a schema. For example, an entity may be
represented as a subgraph in a large graph, such as a person in a
social graph. In this context, all the central issues of data quality
have to be revisited. These are far more challenging than their
counterparts for relational data, and effective techniques are not
yet in place. Consider data deduplication, for instance. Given
two graphs (without a schema), we want to determine whether
they represent the same object. To do this, we need to extend
data quality rules from relations to graphs.
15
16
17