0% found this document useful (0 votes)
41 views17 pages

Querying Big Data: Bridging Theory and Practice: FN LN AC City Salary Status T T T S S

This document provides an overview of recent advances in querying big data. It discusses how big data introduces challenges to query answering from theory to practice. It proposes approaches to determining which queries are tractable on big data, making big data "small" so queries are feasible, and developing approximation algorithms when exact answers are not possible due to big data sizes. Open problems are identified for future research on querying big data.

Uploaded by

benben08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views17 pages

Querying Big Data: Bridging Theory and Practice: FN LN AC City Salary Status T T T S S

This document provides an overview of recent advances in querying big data. It discusses how big data introduces challenges to query answering from theory to practice. It proposes approaches to determining which queries are tractable on big data, making big data "small" so queries are feasible, and developing approximation algorithms when exact answers are not possible due to big data sizes. Open problems are identified for future research on querying big data.

Uploaded by

benben08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Querying Big Data: Bridging Theory and Practice

Wenfei Fana,b, Jinpeng Huaib


a School
b International

of Informatics, University of Edinburgh, 10 Crichton Street, Edinburgh, EH8 9AB, United Kingdom
Research Center on Big Data, Beihang University, Beijing, No.37 XueYuan Road, 100083, Beijing, China

Abstract
Big data introduces challenges to query answering, from theory to practice. A number of questions arise. What queries are
tractable on big data? How can we make big data small so that it is feasible to find exact query answers? When exact answers
are beyond reach in practice, what approximation theory can help us strike a balance between the quality of approximate query
answers and the costs of computing such answers? To get sensible query answers in big data, what else do we necessarily do in
addition to coping with the size of the data? This position paper aims to provide an overview of recent advances in the study of
querying big data. We propose approaches to tackling these challenging issues, and identify open problems for future research.
Keywords: Big data, query answering, tractability, distributed algorithms, incremental computation, approximation, data quality.

1. Introduction
Big data is a term that is almost as popular as internet
was back 20 years ago. It refers to a collection of data sets
so large and complex that it becomes difficult to process using traditional database management tools or data processing
applications [92]. More specifically, big data is often characterized with four Vs: Volume for the scale of the data, Velocity for its streaming or dynamic nature, Variety for its different
forms (heterogeneity), and Veracity for the uncertainty (poor
quality) of the data [64]. Such data comes from social networks (e.g., Facebook, Twitter, Sina Weibo), e-commerce systems (e.g., Amazon, Taobao), finance (e.g., stock transactions),
sensor networks, software logs, e-government and scientific research (e.g., environmental research), just to name a few, where
data is easily of PetaByte (PB, 1015 bytes) or ExaByte (EB,
1018 bytes) size. The chances are that big data will generate as
big impacts on our daily lives as internet has done.
New challenges. As big data researchers, we do not confine
with the general characterization of big data. We are more interested in what specific technical problems or research issues
big data introduces to query answering. Given a dataset D and
a query Q, query answering is to find the answers Q(D) to Q
in D. Here Q can be an SQL query on relational data, a keyword query to search documents, or a personalized social search
query on social networks (e.g., Graph Search of Facebook [29]).
Example 1. A fraction D0 of an employee dataset of a company is shown in Figure 1. Each tuple in D0 specifies the firstname (FN), last name (LN), salary and marital status of an employee, as well as the area code (AC) and city of her office. A
query Q0 is to find distinct employees whose first name is Mary.
Email addresses: [email protected] (Wenfei Fan),
[email protected] (Jinpeng Huai)
Preprint submitted to Elsevier

t1 :
t2 :
t3 :
s1 :
s2 :

FN
Mary
Mary
Mary
Bob
Robert

LN
Smith
Webber
Webber
Luth
Luth

AC
20
10
10
212
212

city
Beijing
Beijing
Beijing
NYC
NYC

salary
50k
50k
80k
80k
55k

status
single
married
married
married
married

Figure 1: An employee dataset D0

Such a query can be expressed in, e.g., relational algebra, written as FN = Mary R0 by using selection operator [1], where
R0 is the relation schema of D0 . To answer the query Q0 in
D0 , we need to find all tuples in D0 that satisfy the selection
condition: FN = Mary, i.e., tuples t1 , t2 and t3 .

In the context of big data, query answering becomes far more
challenging than what we have seen in Example 1. The new
complications include but are not limited to the following.
Data. In contrast to a single traditional database D0 , there are
typically multiple data sources with information relevant to our
queries. For instance, a recent study shows that many domains
have tens of thousands of Web sources [22], e.g., restaurants,
hotels, schools. Moreover, these data sources often have a large
volume of data (e.g., of PB size) and are frequently updated.
They have different formats and may not come with a schema,
as opposed to structured relational data. Furthermore, many
data sources are unreliable: their data is typically dirty.
Query. Queries posed on big data are no longer limited to our
familiar SQL queries. They are often for document search, social search or even for data analysis such as data mining. Moreover, their semantics also differs from traditional queries. On
one hand, it can be more flexible: one may want approximate
answers instead of exact answers Q(D). On the other hand,
one could ask query answering to be ontology-mediated by coupling datasets with a knowledge base [12], or personalized and
August 6, 2014

ter how large the underlying dataset D is. For instance, when
Q is a Boolean conjunctive query (a.k.a. SPC query [1]), we
need at most ||Q|| tuples from D to answer Q, independent of
the size of D, where ||Q|| is the number of tuples in the tableau
representation of Q. This is also the case for many personalized social search queries. Intuitively, if a core DQ of D for
answering Q has a bounded size, then Q is scale independent
in D [36], i.e., we can efficiently compute Q(D) no matter how
big D is. This suggests that we study how to determine whether
a query is scale independent in a dataset.

context-aware [86] such that the same query gets different answers when issued by different people in different locations.
These tell us that query answering in big data is a departure from our familiar terrain of traditional database queries. It
raises a number of questions. Does big data give rise to any new
fundamental problems? In other words, do we need new theory
for querying big data? Do we need to develop new methodology for query processing in the context of big data? What practical techniques could we use to cope with the sheer volume of
big data? In addition to the scalability of query answering algorithms, what else do we have to pursue in order to find sensible
or even correct query answers in big data?

In addition, we develop several practical techniques for making big data small. These include (a) distributed query
processing by partial evaluation [47], with provable performance guarantees on both response time and network traffic;
(b) query-preserving data compression [45]; (c) view-based
query answering [50]; and (d) bounded incremental computation [49, 82]. All these techniques allow us to compute Q(D)
with a cost that is not a function of the size of D, and have
proven effective in querying social networks. The list is not exclusive: there are many other techniques for making big data
small and hence, making queries feasible on big data.

Querying big data. This paper presents an overview of recent


advances in the study of these problems. It is a progress report
of the International Research Center on Big Data at Beihang
University [10], which was established in September 2012, and
has been working on querying big data since then. We report
how we tackle the problems mentioned above.
BD-tractability. The first question we need to answer is what
queries are tractable on big data. Given a query Q and a big
dataset D, we want to know whether we can compute Q(D)
within our available resources such as time and space. As found
in most textbooks (e.g., [1, 81]), a class of queries is traditionally considered tractable if there exists an algorithm for answering its queries in time bounded by a polynomial in the size of
the input (PTIME), i.e., a database and a query. In other words,
a class of queries is feasible from a theoretical perspective if its
worst-case time complexity is PTIME, while a class is considered difficult to solve when it is NP-hard. This notion of time
complexity dates back to 1965 [60] and is almost 50-years old.
When it comes to big data, however, PTIME queries may
no longer be feasible. For instance, consider the query Q0 and
dataset D0 given in Example 1. To compute Q0 (D0 ) in the
absence of any indices, one may need to scan D0 . Assuming
the fastest Solid State Drives (SSD) with disk scanning speed
of 6GB/s [85], a linear scan of D0 takes 166,666 seconds when
D0 consists of 1PB of data; that is, 2,777 minutes, 46 hours,
or 1.9 days! When D0 has 1EB of data, we have to wait 5.28
years for a linear scan of D0 . That is, even linear-time (O(n))
queries become infeasible in the context of big data.
This suggests that we revise the classical computational complexity theory for querying big data. To this end, we propose a
notion of BD-tractable queries [38], to help us determine what
queries are tractable or feasible on big data.

Query-driven and data-driven approximation. Some queries


neither are BD-tractable nor can be made BD-tractable. An
example is graph pattern matching by subgraph isomorphism.
Here query Q is a graph pattern, dataset D is a graph, and the
answer Q(D) is the set of all subgraphs of D that are isomorphic to Q. Such queries are expensive: it is NP-complete even
to decide whether there exists a subgraph of D that is isomorphic to Q! It is beyond reach in the context of big D to compute
exact answers Q(D). In light of this, algorithms for processing
such queries on big data are necessarily inexact. We may have
to settle with heuristics, quick and dirty algorithms which return approximate answers that are not necessarily optimal [81].
This highlights the need for studying the next question: how
can we develop approximation algorithms, i.e., heuristics which
find answers that are guaranteed to be not far from the exact
query answers? We propose two types of approximation.
(1) Query-driven approximation. For certain queries we can
relax their semantics and reduce the complexity of query processing. One example is the class of graph pattern queries mentioned above, for social network analysis. Instead of adopting subgraph isomorphism for graph pattern matching, we can
use (revisions of) graph simulation [41, 42, 74]. This reduces
the complexity of graph pattern matching from intractability
by subgraph isomorphism to quadratic-time or cubic-time by
(revised) graph simulation! Better still, the revised notions of
graph simulation allow us to catch more sensible matches in
social data analysis than subgraph isomorphism can find.

Making queries BD-tractable. It is not surprising that many


query classes are not BD-tractable. The next question naturally
asks whether we can make these query classes BD-tractable?
We approach this by studying both its fundamental problems
and practical techniques, by making big data small.
To understand what it takes to compute answers Q(D) of
a query Q in a dataset D, we want to identify a core of D
for answering Q, i.e., a minimum subset DQ of D such that
Q(D) = Q(DQ ). Indeed, it often suffices to fetch a small or
even a bounded subset DQ of D for computing Q(D), no mat-

(2) Data-driven approximation. In some applications we may


not be able to relax query semantics. To this end, we propose a notion of resource-bounded approximation in this paper. In contrast to traditional approximation algorithms that directly operate on a given big dataset D, we first reduce D to
small data DQ with a lower resolution (0, 1], such that
|DQ | |D|. We then compute Q(DQ ) as approximate query
2

answers to Q, such that Q(DQ ) is within a performance ratio to the exact answer Q(D). We explore the connection
between the resolution and the quality bound , to strike a
balance between the computation cost and the quality of the
approximate answers. Our preliminary study [52] has shown
that for personalized social search queries, the performance ratio remains 100% even when the resolution is as small as
0.0015% (15106). That is, we can reduce D of 1PB to DQ
of 15GB, while still retaining exact answers for such queries!

The study of querying big data is still in its infancy, and it


has raised as many questions as it has answered. In light of
this, we also identify open research issues in this paper, and
propose approaches to tackling them. We hope that the paper
will incite interest in the study of querying big data, and we
invite interested colleagues to join forces with us in the study.

Big data = quantity + quality. To compute high-quality query


answers from big data, it is often insufficient just to develop
scalable algorithms to cope with large volume of the data. To
illustrate this, let us consider the following example.

This section studies the following problem: given a class Q


of queries that we need to use, we want to determine whether Q
is tractable in big data, i.e., it is feasible to answer the queries of
Q in big data within our available resources. As we have seen
in Section 1, polynomial time can no longer provide a characterization for Q to be tractable in big data. This suggests that
we revise the traditional notion of tractability, and define BDtractability, i.e., tractability for queries on big data.

2. Tractability Revised for Querying Big Data

Example 2. Recall query Q0 and dataset D0 from Example 1.


Suppose that we have efficient techniques in place to compute
Q0 (D0 ) for big D0 . As remarked earlier, Q0 (D0 ) consists of
three tuples t1 , t2 and t3 . The question is: can we trust Q0 (D0 )
to be the correct answer to what the user wants to find?
Unfortunately, there are at least three reasons that discredit
our trust in Q0 (D0 ). (1) In tuple t1 , attribute t1 [AC] is 20 and
t1 [city] is Beijing, while the area code of Beijing is 10. In light
of this, tuple t1 is inconsistent and hence, its quality is in
question. (2) The chances are that all three tuples t1 , t2 and t3
refer to the same person; in other words, they do not represent
distinct employees. (3) Furthermore, the dataset D0 may be
incomplete: for some employees whose first name is also Mary,
their records are not included in D0 . In light of these, we do
not know whether the answer Q0 (D0 ) is correct or not!


Below we present a notion of BD-tractable queries. We encourage the interested reader to consult [38] for details.
Preliminaries. We start with a review of two well-studied complexity classes (see, e.g., [58, 67] for details).
The complexity class P consists of all decision problems
that can be solved by a deterministic Turing machine in
polynomial time (PTIME), i.e., in nO(1) time, where n is
the size of the input (dataset D and query Q in our case).
The parallel complexity class NC, known as Nicks Class,
consists of all decision problems that can be solved by taking O(logO(1) n) time on a PRAM (parallel random access
machine) with nO(1) processors.

From the example we can see that when the datasets are dirty,
we cannot trust the answers to our queries in those datasets. In
other words, no matter how big datasets we can handle and
how fast our query processing algorithms are, the query answers computed may not be correct and hence may be useless!
Unfortunately, real-life data is often dirty [33], and the scale of
data quality problems is far worse in the context of big data,
since real-life data sources are often unreliable. Therefore, the
study of the quality of big data is as important as techniques for
coping with its quantity; that is, big data = quantity + quality!
This motivates us to study the quality of big data. We consider five central issues of data quality: data consistency [34],
data accuracy [16], information completeness [32], data currency [40] and entity resolution [31], from theory to practice.
We study how to repair dirty data [20, 43, 46] and how to deduce true values of an entity [39], among other things, emphasizing new challenges introduced by big data.

In this paper we focus on query classes rather than decision


problems. We use P to denote the set of all PTIME query
classes. We say that a query class Q is in NC if all of its queries
can be answered in parallel polylog-time, i.e., polynomial time
in the logarithm of the input using a PRAM with polynomially
many processors. Such a query class is highly parallel feasible, i.e., its queries can be efficiently answered on a parallel
computer [58]. It is also known that a large class of NC algorithms can be implemented in the MapReduce framework [69],
such that if an NC algorithm takes t time, than its corresponding MapReduce counterpart takes O(t) rounds. We use NC to
denote the set of all such parallel polylog-time query classes. It
should be remarked that there have been revisions of the PRAM
model by requiring log n processors instead of nO(1) [25].
BD-tractability. To make query answering feasible in big data,
we adopt two ideas: (1) using parallel machines, and (2) separating offline and online processes. The second idea suggests
that we preprocess a dataset D by, e.g., building indices or compressing the data, which yields dataset D , such that all queries
in Q on D can subsequently be processed on D online efficiently. When the data is static or when D can be incrementally
maintained efficiently, the preprocessing step can be considered
as an offline process with a one-time cost. Preprocessing has
been a common practice of database people for decades.

Organization. The remainder of the paper is organized as follows. We start with BD-tractability in Section 2. We study scale
independence and present several practical techniques for making queries BD-tractable in Section 3. When BD-tractable algorithms for computing exact query answers are beyond reach in
practice, we study approximate query answering in Section 4,
by proposing query-driven approximation and data-driven approximation. We study the other side of big data, namely, data
quality, in Section 5. Finally, Section 6 concludes the paper.
3

be answered in O(log|D|) time by using the indices in (D0 ).


In fact, the class of all relational algebra queries extended
with transitive closure is also in BDT0 over ordered relational
datasets, since those queries are in NC in this setting [88]. 

Example 3. Recall query Q0 and dataset D0 from Example 1.


Extending Q0 , let us consider a class Q0 of Boolean selection
queries. A query Q in Q0 is to find whether there exists a tuple
t D0 such that t[A] = c, where A is an attribute of D0 and c
is a constant. A naive evaluation of Q would require a linear
scan of D0 . To efficiently answer queries of Q0 in D0 , we can
first build B + trees on the values of the attributes of D0 , as
a one-time preprocessing step offline. Then we can evaluate all
queries in Q0 on D0 in O(log |D0 |) time using the indices. That
is, we no longer need to scan D0 when processing each query
in Q0 . When D0 consists of 1PB of data, we can get the results
in 5 seconds with the indices rather than 1.9 days.


Making queries BD-tractable. Some query classes Q are


not BD-tractable, but can be transformed to a BD-tractable
query class by means of re-factorizations. A re-factorization repartitions the data and query parts for Q and identifies a dataset
for preprocessing, such that after the preprocessing, its queries
can be subsequently answered in parallel polylog-time.
Complexity class BDT. More specifically, we say that a class
Q of queries can be made BD-tractable if there exist three NC
computable functions 1 (), 2 () and (, ) such that for all
hD, Qi in the language S of pairs for Q,

Based on these two ideas, below we propose a revision of the


traditional notion of tractable query classes.
To be consistent with the complexity classes that are traditionally studied for decision problems [58, 67], we consider
Boolean query classes Q, and represent Q as a language S of
pairs hD, Qi, where Q is a query in Q, D is a database on which
Q is defined, and Q(D) is true. In other words, S can be considered as a binary relation such that hD, Qi S if and only if
Q(D) is true. We refer to S as the language for Q.

D = 1 (D, Q), Q = 2 (D, Q), hD, Qi = (D , Q ), and


the query class Q = {Q | Q = 2 (D, Q), hD, Qi S} is
BD-tractable.
Intuitively, 1 () and 2 () re-partition x = hD, Qi into a
data part D = 1 (x) and a query part Q = 2 (x), and
is an inverse function that restores the original instance x from
1 (x) and 2 (x). The data part D is picked from x and will
be preprocessed, such that after the preprocessing step, all the
queries Q Q can then be answered in parallel polylog-time.
We use BDT to denote the set of all query classes that can be
made BD-tractable. Obviously, BDT0 is a subset of BDT, when
D = 1 (D, Q), Q = 2 (D, Q), and is the identity function. As
will be seen next, BDT0 is a proper subset of BDT unless P =
NC, i.e., there is a query class that is in BDT but not in BDT0 .

We say that a language S of pairs is in complexity class CQ


if it is in CQ to decide whether a pair hD, Qi S, i.e., Q(D) is
true. Here CQ may be the sequential complexity class P or the
parallel complexity class NC, among other things.
Complexity class BDT0 . We say that a class Q of queries is
BD-tractable if there exist a PTIME-computable preprocessing
function on datasets and a language S of pairs such that for
queries Q Q and all datasets D,
hD, Qi is in the language S of pairs for Q if and only if
h(D), Qi S , and

Example 5. Consider Breadth-Depth Search (BDS) [58]:


Input: An undirected graph G = (V, E) with a numbering
on the nodes, and a pair (u, v) of nodes in V .

S is in NC, i.e., the language of pairs h(D), Qi is in NC.


We denote by BDT0 the set of all BD-tractable query classes.

Question: Is u visited before v in the breadth-depth search


of G induced by the vertex numbering?

Intuitively, function () preprocesses D and generates another structure D = (D) offline, in PTIME. After this, for all
queries Q Q that are defined on D, Q(D) can be answered by
evaluating Q(D ) online in NC, i.e., in parallel polylog-time.

A breadth-depth search starts at a node s and visits all its children, pushing them onto a stack in the reverse order induced by
the vertex numbering as the search proceeds. After all of ss
children are visited, the search continues with the node on the
top of the stack, which plays the role of s.
In the problem statement of BDS given above, the entire input, i.e., x = (G, (u, v)), is treated as a query, while its data part
is empty. In this setting, there is nothing to be preprocessed.
Moreover, it is known that BDS is P-complete (cf. [58]), i.e., it
is the hardest problem in the complexity class P. Unless P =
NC, such a query cannot be processed in parallel polylog-time.
In other words, this class of BDS queries is not in BDT0 unless
P = NC. It is also known that the question whether P = NC is
as hard as our familiar open question whether P = NP.
Nonetheless, there exists a re-factorization (1 , 2 , ) of its
instances x = (G, (u, v)) that identifies G as the data part and
(u, v) as the query part. More specifically, 1 (x) = G, 2 (x) =
(u, v), and maps 1 (x) and 2 (x) back to x. Given this,

Observe the following. (a) As shown in Example 3, parallel


polylog-time is feasible on big data. Moreover, NC is robust
and well-understood. It is one of the few parallel complexity
classes whose connections with classical sequential complexity
classes have been well studied (see, e.g., [58]). (b) We consider
PTIME preprocessing feasible since it is a one-time price and
is performed offline. Note that the preprocessing step is also
expected to be conducted using parallel machines, possibly by
allocating more resources (e.g., computing nodes) to it than to
online query answering. Moreover, by requiring that () is in
PTIME, the size of (D) is bounded by a polynomial.
Example 4. As we have seen in Example 3, the class Q0 of
Boolean selection queries is in BDT0 . Indeed, function ()
preprocesses a dataset D0 by building B + -trees on attributes
of D0 in PTIME. After this, all queries in Q0 posed on D0 can
4

class Q is to decide, given a query Q Q, a dataset D and an


element e, whether e Q(D), i.e., e is in the answer to Q in D.

we define preprocessing () as the function that performs


breadth-depth search on G based on the ordering on the vertices, and returns a list M consisting of all the nodes in V in the
same order as they are visited during the search. Then (G)
is clearly in PTIME in |G|. Let S be the language of pairs
hM, (u, v)i such that u appears before v in M . Obviously, one
can decide whether (M, (u, v)) S by binary searches on M ,
in O(log |M |) time. Hence BDS is in BDT. In other words,
while BDS is not BD-tractable, it can be made BD-tractable by
means of a re-factorization. In light of this, BDS provides a
witness that separates BDT and BDT0 , unless P = NC.


Open issues. There has been a host of recent work on revising the traditional complexity theory to characterize dataintensive computation on big data. The revisions are defined in
terms of computational costs [38], communication (coordination) rounds [61, 71], or MapReduce steps [69] and data shipments [3] in the MapReduce framework [23]. Our notions of
BD-tractability focus on computational costs [38]. The study is
still preliminary, and a number of questions remain open.
(1) The first question concerns what complexity class precisely
characterizes online query processing that is feasible on big
data. As a starting point we adopt NC because (a) NC is considered highly parallel feasible [58]; (b) parallel polylog-time is
feasible on big data; and (c) many NC algorithms can be implemented in the MapReduce framework [69], which is being used
in cloud computing and data centers for processing big data.
However, NC is defined in the PRAM model, which may not be
accurate for real-life parallel frameworks such as MapReduce.

Fundamental issues. There are several important questions in


connection with BD-tractability. What reductions can we use to
transform one query class in BDT to another? Does there exist
a natural class Q of queries that is complete for BDT, i.e., Q is
a class of the hardest queries in BDT? How large is BDT?
In other words, is it a new complexity class or the same as P
or NC? The same questions also arise for BDT0 . In fact, these
are the standard questions one would have to answer for any
complexity class, including our familiar P and NP.
These questions have been studied for BDT and BDT0 [38].

These call for a full treatment of parallel computation models


that are more practical than PRAM for characterizing available
resources in the real world. Such models should take into account both computational complexity and communication costs.
Upon the availability of such models, the class BDT0 of BDtractable queries should then be revised accordingly.

NC
has been defined for BDT,
A form of NC-reductions 6fa
NC
which is transitive (i.e., if Q1 6NC
fa Q2 and Q2 6fa Q3 then
NC
NC
Q1 6fa Q3 ) and compatible with BDT (i.e., if Q1 6fa
Q2 and Q2 is in BDT, then so is Q1 ). Similarly, NCreductions have been defined for BDT0 with these properties. In contrast to our familiar PTIME-reductions for NP
problems (see, e.g., [81]), these reductions require a pair
of NC functions, i.e., both are in parallel polylog-time.

(2) The second question concerns the complexity of preprocessing. Let us use PQ[CP , CQ ] to denote the set of all query classes
that can be answered by preprocessing the data sets in the complexity class CP and subsequently answering the queries in CQ .
Then BDT0 can be represented by PQ[P, NC]. One may consider other complexity classes CP instead of P. For instance,
one may consider PQ[NC, NC] by requiring the preprocessing
step to be conducted more efficiently; this is not very interesting
since PQ[NC, NC] coincides with NC. On the other hand, one
may want to consider CP beyond P, e.g., NP and PSPACE (i.e.,
PQ[NP, NC] and PQ[PSPACE, NC]). This is another debatable
issue that demands further study. No matter what PQ[CP , CQ ]
we use, one has to strike a balance between its expressive power
and computational cost in the context of big data.

There exists a complete query class Qm for BDT under


6NC
fa reductions, i.e., Qm is in BDT and moreover, for all
NC
query classes Q BDT, Q 6fa
Qm . However, the question whether there exists a complete query class for BDT0
is as hard as the open question whether P = NC.
NC BDT = P. That is, all PTIME query classes can be
made BD-tractable via proper re-factorizations, or in other
words, by transforming them to a query class in BDT via
NC
6fa
reductions. In contrast, unless P = NC, BDT0 P,
i.e., BDT0 is indeed a proper subset of P, and hence, not
all PTIME queries are BD-tractable.

(3) BD-tractability has only been studied for Boolean queries


and decision problems, as people usually do in complexity theory. Nevertheless, BD-tractability for general queries, as well
as for search and function problems, remains to be studied.

These results are not only of theoretical interest, but also provide guidance for us to answer queries in big data. For instance,
given a query class Q, we can conclude that it can be made BDtractable if we can find a 6NC
fa reduction to a complete query
class Qm of BDT. If so, we are warranted an effective algorithm for answering queries of Q in big data. Indeed, such an
algorithm can be developed by simply composing the NC reduction and an NC algorithm for processing Qm queries; then
the algorithm remains in parallel polylog-time.
One may ask what query classes may not be made BDtractable. The results above also tell us the following: unless
P = NP, all query classes for which the membership problem is
NP-hard are not in BDT. The membership problem for a query

(4) There are a number of open issues in connection with query


evaluation with preprocessing. Given a query class, how can we
effectively identify a re-factorization that appropriately picks
the right dataset to be preprocessed? What preprocessing strategies should we use? If a query class cannot be made BDtractable, can we still answer its queries in big data? We will
address some of these questions in the next a few sections.
(5) The last question concerns the existence of a complete
query class for BDT0 . However, this is as hard as the problem
whether P = NC, which is as hard as whether P = NP.
5

3. Making Big Data Small

Example 6 [36] Some real-life queries are actually scale independent. For example, below are (slightly modified) personalized search queries taken from Graph Search of Facebook [29].

Following up the notion of BD-tractability presented in the


last section, we next investigate how we can make queries BDtractable. There are many ways to do this, such as building up
indices as we have seen in Example 3. In this section we focus
on a particular approach, by making big data small. Suppose
that we need to answer a class Q of queries in a big dataset D.
We propose to reduce D to a dataset D (or a number of fragments D ) of a manageable size, such that (1) for all queries
Q Q, Q(D) = Q(D ), and (2) we can efficiently answer Q in
D within our available resources. In other words, as a preprocessing step, we reduce big D to small D such that we can
still compute exact answers Q(D) by accessing only the small
dataset D instead of operating on the original big D directly.

(1) Query Q1 is to find all NYC friends of a person p0 , from


a dataset D1 . Here D1 consists of two relations specified by
person(id, name, city) and friend(id1 , id2 ), which record the basic information of people (with a key id) and their friend relationships, respectively. Query Q1 can be written as follows:

Q1 (name) = id friend(p0 , id)person(id, name, NYC) .

Observe the following. (1) In personalized social searches we


evaluate queries with a specified person, e.g., p0 in Q1 . (2)
Dataset D1 is often big in real life. For instance, Facebook has
more than 1 billion users with 140 billion friend links [28]. A
naive computation of the answer to Q1 , even if p0 is known,
may fetch the entire D1 , and is cost prohibitive.
Nonetheless, we can compute Q1 (D1 ) by accessing only a
small subset DQ1 of D1 . Indeed, Facebook has a limit of 5000
friends per user (cf. [7]), and id is a key of person. Thus by
using indices on id attributes, we can identify DQ1 , which consists of a subset Df of friend including all friends of p0 , and a
set Dp of person tuples t such that t[id] = t [id2] for some tuple
t in Df . Then Q1 (DQ1 ) = Q1 (D1 ). Moreover, DQ1 contains
at most 10000 tuples of D1 , and is much smaller than D1 . Thus
Q1 is scale independent in D1 w.r.t. M 10000. In fact, one
can verify that Q1 is scale independent in all instances of the
schemas person and friend that satisfy the two constraints.

The idea is simple. But to implement it, we need to settle several fundamental questions and develop practical techniques.
Below we first study questions concerning whether it is possible at all to find a small dataset D such that Q(D) = Q(D ).
We then present several practical techniques to make big data
small, which have been evaluated by using social network analysis as a testbed, and have proven effective in the application.
3.1. Scale Independence
We start with fundamental problems associated with the approach to making big data small. We first study the existence
of a small subset D of D such that we can answer Q in D by
accessing only the data in D . We then present effective methods for identifying such a D . We invite the interested reader to
consult [36] for a detailed report on this subject.

(2) Consider another query Q2 , which is to find from a dataset


D2 all A-rated NYC restaurants that were visited by NYC friends
of p0 in 2013. Here D2 consists of four relations, specified
by a relational schema R2 including person and friend as
above, as well as restr(rid, name, city, rating) (with rid as a key)
and visit(id, rid, yy, mm, dd) (indicating that person id visited
restaurant rid on a given date). Then Q2 can be expressed as:

To simplify the discussion we consider relational queries. Let


R be a relational schema (i.e., R = (R1 , . . . , Rn ), where Ri is a
relation schema [1]), D a database instance of R, Q a query in
query class Q such as relational algebra or conjunctive queries,
and M a non-negative integer. Let |D| denote the size of D,
measured as the total number of tuples in relations of D.

Q2 (rn, yy) = id, rid, pn, mm, dd friend(p0 , id)


visit(id, rid, 2013, mm, dd) person(id, pn, NYC) 
restr(rid, rn, NYC, A) .

The definition. We say that Q is scale independent in D w.r.t.


M if there exists a subset DQ D such that
|DQ | M , and

Note that query Q2 is also scale-independent. Indeed, (a) a


year has at most 365 days; and (b) it is safe to assume that on
a given day, each person id dines out at most once. Putting
these together with the constraints on friend and person (i.e.,
a person can have at most 5000 friends at Facebook, and id
is a key of person), one can compute Q2 (D2 ) by accessing a
bounded number of tuples, instead of scanning the entire D2 .
Indeed, Q2 is scale independent for all instances of schema R2
under these constraints.


Q(DQ ) = Q(D).
That is, to answer Q in D, we need only to fetch at most M
tuples from D, regardless of how big D is. We refer to DQ as
a core for answering Q in D. Note that DQ may not be unique.
As will be seen shortly, we want to find a minimum core.
One step further, we say that Q is scale independent for R
w.r.t. M if for all instances D of R, Q is scale independent in
D w.r.t. M , i.e., one can always find a core DQ with at most M
tuples for answering Q in D.

One can show that a query Q is scale independent for any


schema R over which Q is defined when Q is either
a Boolean conjunctive query if ||Q|| M , or
a top-k conjunctive query for a constant k and a scoring
function f if k||Q|| M ,

The term scale independence is borrowed from [6, 7, 8].


The need for studying scale independence is evident in practice.
It allows us to answer Q in big D by accessing a small dataset
within our available resources. Moreover, if Q is scale independent for R, we can answer Q without performance degradation
when D grows, and hence, make Q scalable with |D|.

where ||Q|| is the number of tuple templates in the tableau presentation of the conjunctive query Q [1]. Here Q is Boolean
if for any instance D of R, Q(D) returns true if Q(D) is
6

nonempty and false otherwise; and Q is a top-k query if


Q(D) returns a subset U Q(D) such that (a) U consists of
at most k tuples (|U | = k if |Q(D)| k), and (b) for all tuples
t Q(D)\U and s U , f (s) f (t) [30].

Example 7. Continuing with Example 6, we would have a tuple (friend, id1 , 5000, T ) for some value T in the access schema
A. That is, there exists an index on id1 such that if id1 is provided, at most 5000 tuples with such an id exist in friend, and it
takes time T to retrieve those. In addition, we would have a tuple (person, id, 1, T ) in A, indicating that id is a key for person
with a known time T for retrieving the tuple for a given id. 

Decision problems. To determine whether a query Q is scale


independent, we need to study the following decision problems.
The scale independence problem for (Q, D).
INPUT: A relational schema R, an instance D of R,
a query Q Q over R, and M 0.
QUESTION: Is Q scale independent in D w.r.t. M ?

Computing a core by leveraging access schema. Given a relational schema R, we say that a query Q is scale independent
under access schema A if for all instances D of R that conform
to A, the answer Q(D) can be computed in time that depends
only on A and Q, but not on D. That is, Q is scale independent for R in the presence of A, independent of the size of the
underlying D. The following results are known.

The scale independence problem for Q.


INPUT: R, a query Q Q over R, and M 0.
QUESTION: Is Q scale independent for R w.r.t. M ?
That is, we want to find minimum cores for answering Q.
The complexity bounds of these problems have been established [36]. The problems are rather intriguing. For instance,
NP
the first one is p3 -complete (NPNP ) when Q is the class of
conjunctive queries, and it is PSPACE-complete when Q is relational algebra (i.e., first-order logic). Worse still, the second
problem becomes undecidable for relational algebra. This is not
surprising in database theory: for instance, the classical membership problem (see Section 2) is NP-complete for conjunctive
queries, and PSPACE-complete for relational algebra [1].

There is a set of syntactic rules for us to determine whether


a relational algebra query Q is scale independent under A;
this provides us with a systematic method and a sufficient
condition to check whether Q can be answered by accessing a bound number of tuples in all instances of D [36].
For conjunctive queries Q, there exists a characterization,
i.e., a sufficient and necessary condition, to decide whether
Q is scale independent under A; better still, the decision
problem is in polynomial time in the size of Q and A [17].
If Q is scale independent under A, then an efficient query
plan can be worked out using the rules, such that we can
find a core DQ with a bounded size and Q(D) = Q(DQ ).
For conjunctive queries, there has been an experimental
study with real-life data that shows such a query plan take
9 seconds as opposed to 14 hours by commercial system MySQL [17]! Moreover, it is easy to mine access
constraints from real-life data, and a large percentage of
queries are scale independent under simple access constraints. In other words, the approach by exploring scale
independence is effective and practical.

Identifying a core. We have seen that it is rather expensive to


determine whether a query Q is scale independent. Moreover,
even after Q is found scale independent in a dataset D, it is
non-trivial to identify a core DQ for answering Q in D with a
bounded size. As an example, consider a Boolean conjunctive
query Q over a relational schema R. As remarked earlier, we
know that Q is scale independent for R. The question is: is
there an efficient algorithm that, given an instance D of R, finds
a core DQ D such that |DQ | ||Q|| and Q(DQ ) = Q(D)?
We approach this following the common practice of database
people: we provide a sufficient condition for checking whether
Q is scale independent and if so, for helping us efficiently compute a core for answering Q. This is formalized as follows.

3.2. Making Queries BD-tractable


We next turn to practical techniques for making big data
small, and hence, BD-tractable. We take graph pattern matching in social graphs as our application domain, and present
four data reduction strategies as examples, namely, distributed
query processing via partial evaluation [47], query-preserving
data compression [45], view-based query answering [50], and
bounded incremental computation [49, 82]. The idea behind
these approaches is simple. When our dataset D is a social
graph G and Q is a pattern query, the complexity of computing
query answer Q(G) (the set of matches of Q in G) is measured
by a function f (|Q|, |G|). Since f (, ) may be the lower bound
of the computation and cannot be further reduced, and |Q| is
typically small in practice, we reduce |G|, i.e., by making big
G small, to reduce the response time of query answering.

Access schema. We define an access schema A over a relational


schema R to be a set of tuples (R, X, N, T ), where
R is a relation schema in R,
X is a set of attributes of R, and
N and T are natural numbers.
We say that a database instance D of R conforms to the access schema A if for each (R, X, N, T ) A:
for each tuple of values a
of attributes of X, the set
X=a (R) has at most N tuples i.e., there exist at most
N tuples t in R such that t[X] = a
; and
X=a (R) can be retrieved from D in time at most T .
That is, there exists an index on X that allows efficient retrieval
of certain tuples from D, and there is a bound on the number of
such tuples. Access schemas are a combination of indices and
database dependencies, which are commonly used in practice.

Graph pattern matching. We start with a review of graph pattern matching in social graphs, which typically represent social
networks, e.g., Facebook, Twitter, LinkedIn.
7

evaluation (see [68] for a survey). Given a function f (s, d) and


part of its input s, partial evaluation is to specialize f (s, d) with
respect to the known input s. That is, it conducts as much as
possible the part of f (s, )s computation that depends only on
s, and generates a partial answer, i.e., a residual function f ()
that depends on the as yet unavailable input d.
This idea can be naturally applied to distributed graph pattern matching. Consider a graph pattern Q posed on a graph G
that is partitioned into fragments F = (F1 , . . . , Fn ), where Fi is
stored in site Si . We compute Q(G) as follows.
(1) The same pattern Q is posted to each fragment in F .
(2) Upon receiving pattern Q, each site Si computes a partial
answer Q(Fi ) of Q in fragment Fi , in parallel, by taking
Fi as the known input s while treating the fragments that
reside in the other sites as yet unavailable input d.
(3) A coordinator site Sc collects partial answers from all the
sites. It then assembles the partial answers and finds the
answer Q(G) to Q in the entire graph G.
The idea behind this is simple: we divide a big G into a collection F = (F1 , . . . , Fn ) of fragments, such that the response
time is determined by the cost of computing Q(Fm ) (step 2),
where Fm is the largest fragment in F , and the cost of assembling partial answers (step 3). In other words, its parallel computational cost is dominated by the largest fragment Fm , rather
than the original big graph G. In this way, we reduce a big G to
small fragments Fi , and hence, reduce the response time. When
G is not already partitioned and distributed, one may first partition G as preprocessing. In particular, when we can afford a
number of processors, each Fi may have a manageable size and
hence, the computation of Q(Fi ) is feasible at each site.
There are many ways to develop distributed algorithms for
graph pattern matching. To evaluate and assess these algorithms, we propose the following criteria. We say that a distributed algorithm T is scalable parallel if for all patterns Q,
all graphs G and all fragmentations F of G,

Social graphs. A social graph is a node-labeled directed graph


G = (V, E, fA ), where (a) V is a finite set of nodes; (b) E V
V , in which (v, v ) denotes an edge from node v to v ; and (c)
fA () is a function that associates each node v in V with a tuple
fA (v) = (A1 = a1 , . . . , An = an ), where ai is a constant, and Ai
is referred to as an attribute of v, written as v.Ai . In social
graphs, each node denotes a person, and its attributes carry the
contents of the node, e.g., label, keywords, blogs, rating. An
edge represents a relationship between two people.
Patterns. A graph pattern is given as Q = (VQ , EQ , fv ), where
VQ is a finite set of nodes and EQ is a set of directed edges,
as defined for social graphs; and
fv () is a function defined on VQ such that for each node
u, fv (u) is the search condition for u, defined as a conjunction of atomic formulas of the form A op a; here A
denotes an attribute, a is a constant, and op is one of the
comparison operators <, , =, 6=, >, .
We say that a node v in a social graph G satisfies the search
condition of a pattern node u in Q, denoted as v u, if for each
atomic formula A op a in fv (u), there exists an attribute A
defined by fA (v) such that v.A op a.
Graph pattern matching. Given a social graph G and a graph
pattern Q, we want to compute the set Q(G) of all matches in G
for Q. In this section we consider a simple semantics for graph
pattern matching, based on graph simulation [77], which has
been widely used in Web site classification and social position
detection, among other things (e.g., [15, 19, 79, 97]).
We say that a social graph G matches a graph pattern Q via
graph simulation, denoted by QEsim G, if there exists a binary
relation S VQ V that is inductively defined as follows:
for each pattern node u VQ , there exists a node v V in
the social graph such that (u, v) S; and
for each (u, v) S, (a) u v, and (b) for each edge (u, u )
in EQ , there is an edge (v, v ) in E such that (u , v ) S.
We refer to S as a match in G for Q.
It is known that if QEsim G, then there exists a unique maximum match So [62], i.e., for any match S in G for Q, S So .
We define Q(G) = So if QEsim G, and Q(G) = otherwise.
It is known that it takes O(|Q|2 +|Q||G|+|G|2 ) time to compute So [62], where |G| denotes the size of G measured in the
number of nodes and edges; similarly for the size |Q| of Q. As
remarked earlier, real-life social graphs are typically big, e.g.,
Facebook graph has more than 1 billion nodes and 140 billion
links [28]. Hence it is often prohibitively expensive to compute
Q(G) for social graphs G in the real world. These highlight
the need for developing efficient techniques for graph pattern
matching to cope with the sheer size of G.

if its parallel computation cost is bounded by a polynomial


in |Q|, |Fm | and |Vf |, and
the total data shipped is bounded by a polynomial in |Q|
and |Vf |,
where Vf is the set of nodes with edges across different fragments in F . That is, the response time of T is dominated by
the size of the query, the largest fragment in F , and how F
partitions G, rather than by the size of the underlying G; similarly for its network traffic. In practice |Vf | is typically much
smaller than |G|, and |Q| is also small. Hence, if algorithm T
has this property, then the more processors are available, the
smaller the fragments tend to be, and therefore, the less parallel
computation time and network traffic are needed,
Note that MapReduce algorithms require us to re-distribute
the data in each round of Map and Reduce; hence, they are not
scalable parallel. In contrast, there exist scalable parallel algorithms for distributed graph simulation based on partial evaluation. Part of the results has been reported in [47] for patterns
defined in terms of regular expressions. It is shown that there
exists a distributed algorithm to answer such pattern queries

Distributed query processing with partial evaluation. Distributed query processing is perhaps the most popular approach
to querying big data, notably MapReduce [23]. Here we advocate distributed query processing with partial evaluation.
Partial evaluation has been used in a variety of applications
including compiler generation, code optimization and dataflow
8

by visiting each site once,

Q are equivalent, i.e., for all datasets D, Q and Q produce the


same answers in D, and moreover, (b) Q refers only to V and
its extensions V(D), without accessing the underlying D.
View-based query answering suggests another approach to
making big data to small. As an example, consider graph pattern queries for social network analysis. Given a big graph G,
one may identify a set V of views (pattern queries) and materialize them with V(G) of matches for patterns of V in G, as a
preprocessing step offline. Then matches for patterns Q can be
computed online by using V(G) only. In practice, V(G) is typically much smaller than G, and hence, this approach allows us
to query big G by accessing small V(G). Better still, the views
can be incrementally maintained in response to changes to G,
and adaptively adjusted to cover various patterns. In light of
this, this approach has generated renewed interest for querying
big graphs as well as other forms of big data [8, 36, 50].
More specifically, for pattern queries based on graph simulation in social network analysis, we know the following [50].
Given a graph pattern Q and a set V of view definitions,
it is in O(|Q|2 |V|) time to decide whether query Q can be
answered by using views V; and if so,

in O(|Fm ||Q|2 +|Q|2 |Vf |2 ) time, and


with O(|Q|2 |Vf |2 ) communication cost.
That is, it has performance guarantees on both response time
and communication cost, as well as on site visits.
Query preserving graph compression. Another approach to
reducing the size of big graph G is by means of compressing
G, relative to a class Q of queries of users choice, e.g., graph
pattern queries. More specifically, a query preserving graph
compression for Q is a pair hR, P i, where R() is a compression function, and P () is a post-processing function. For any
graph G, Gc = R(G) is the compressed graph computed from
G by R(), such that (1) |Gc | |G|, and (2) for all queries
Q Q, Q(G) = P (Q(Gc )). Here P (Q(Gc )) is the result of
post-processing the answers Q(Gc ) to Q in Gc .
That is, we preprocess G by computing the compressed Gc
of G offline. After this step, for any query Q Q, the answers
Q(G) to Q in the big G can be computed by evaluating the
same Q on the smaller Gc online. Moreover, Q(Gc ) can be
computed without decompressing Gc . Note that the compression schema is lossy: we do not need to restore the original
G from Gc . That is, Gc only needs to retain the information
necessary for answering queries in Q, and hence can achieve a
better compression ratio than lossless compression schemes.

Q(G) can be computed in O(|Q||V(G)|+|V(G)|2 ) time;


better still, |V(G)| is about 4% of |G| (i.e., |V |+|E|) on
average for real-life social graphs; and as a result of these,
the view-based approach takes no more than 6% of the
time needed for computing Q(G) directly in G on average.

For a query class Q, if Gc can be computed in PTIME and


moreover, queries in Q can be answered using Gc in parallel
polylog-time, perhaps by combining with other techniques such
as indexing and distributed processing, then Q is BD-tractable.

Contrast these with the O(|Q|2 +|Q||G|+|G|2 ) complexity of


graph simulation! Note that |Q| and |V| are sizes of pattern
queries and are typically much smaller than G in real life.

The effectiveness of this approach has been verified [45],


for graph pattern matching based on graph simulation, and for
reachability queries as a special case (i.e., whether there exists
a path from one node to another via social links). More specifically, the following has been reported in [45].

Incremental graph pattern matching. Given a pattern Q and


a graph G, as preprocessing we compute Q(G) once. When G
is updated by G, instead of recomputing Q(GG) starting
from scratch, we incrementally compute M such that Q(G
G) = Q(G)M , to minimize unnecessary recomputation.
In real life, G is typically small: only 5% to 10% of nodes are
updated weekly [80]. When G is small, M is often small as
well, and is much less costly to compute than Q(GG). The
idea has also been adopted for querying big data [8, 36, 49].
The benefit is more evident if there exists a bounded incremental matching algorithm. As argued in [82], incremental algorithms should be analyzed in terms of |CHANGED| = |G|
+ |M |, the size of changes in the input and output, which
represents the updating costs that are inherent to the incremental problem itself. An incremental algorithm is said to be
semi-bounded if its cost can be expressed as a polynomial of
|CHANGED| and |Q| [49]. That is, its cost depends only on
the size of the changes and the size of pattern Q, independent
of the size of big G. This effectively makes big G small, since
|CHANGED| |G|, and Q is typically small in practice.
For graph pattern matching via graph simulation, it has been
shown that there exists a semi-bounded incremental algorithm
in O(|G|(|Q||CHANGED|+|CHANGED|2 )) time [49].
In general, a query class Q can be considered BD-tractable if
(a) preprocessing Q(D) is in PTIME, and (b) Q(DD) can

There exists a query preserving compression hR, P i for


graph pattern matching with simulation, such that for any
graph G = (V, E, fA ), R() is in O(|E| log |V |) time, and
P () is in linear time in the size of the query answer.
This compression scheme reduces the sizes of real-life social graphs by 98% and 57%, and query evaluation time
by 94% and 70% on average, for reachability queries and
pattern queries with graph simulation, respectively.
Better still, compressed Gc can be efficiently maintained.
Given a graph G, a compressed graph Gc = R(G) of G,
and updates G to G, we can compute changes Gc to
Gc such that Gc Gc = R(GG), without decompressing Gc [45]. As a result, for each graph G, we need
to compute its compressed graph Gc once for all patterns.
When G is updated, Gc is incrementally maintained.
Graph pattern matching using views. This technique is commonly used (see [73, 59] for surveys). Given a query Q Q
and a set V of view definitions, query answering using views
is to reformulate Q into another query Q such that (a) Q and
9

be incrementally computed in parallel polylog-time. If so, it is


feasible to answer Q in response to changes to big data D.

widely used in industry.


(5) As we have seen, view-based query answering provides us
with an effective technique for querying big data. To make practical use of it, however, we need to answer the following question. Given a query workload, what views should we select to
build and maintain, such that the queries can be efficiently answered by using views or better still, be scale independent?

Remarks and open issues. We remark the following.


(1) There are a number of other effective techniques for querying big data, notably indexing we have seen earlier. These techniques and the strategies outlined above can be, and should be,
combined together, when querying big data.
(2) View-based and incremental techniques can help us make
queries scale independent [36]. More specifically, when a query
Q is not scale independent, we may still make it feasible to
query big data incrementally, i.e., to evaluate Q incrementally
in response to changes D to D, by accessing a M -fraction
of the dataset D. That is, we compute Q(D), once and offline,
and then incrementally answer Q on demand. We may also
achieve scale independence using views, i.e., when a set V of
views is defined, we rewrite Q into Q using V, such that for any
dataset D, we can compute Q(D) by using Q , which accesses
materialized views V(D) and fetches only a bounded amount of
data from D. We refer the interested reader to [36] for details.

4. Approximate Query Answering


The strategies we have seen in Section 3 help us make it
feasible to answer some queries in big data. However, some
queries may not be made BD-tractable. An example is graph
pattern matching defined with subgraph isomorphism: it is NPcomplete even to decide whether there exists a match (cf. [81]).
For such queries, it is beyond reach to find exact answers in big
data. Moreover, as remarked earlier, even for queries that can
be answered in PTIME, it is sometimes too costly to compute
their exact answers in big data. In light of this, we often have to
evaluate these queries by using inexact algorithms, preferably
approximation algorithms with performance guarantees.

We conclude the section with several open issues.

This section proposes two approaches to developing approximation algorithms for answering queries in big data, referred
to as query-driven and data-driven approximation.

(1) As we have seen in Section 3.1, access schemas help us determine whether a query is scale independent and if so, develop
an efficient plan to evaluate the query. A practical question asks
how to design an optimal access schema for a given query
workload, such that we can answer as many given queries as
possible by accessing a bounded amount of data.

4.1. Query Driven Approximation


For some query classes Q we can relax its semantics, such
that it is less costly to answer queries Q of Q in a big dataset
D under the new semantics, and moreover, the answer Q(D)
still gives users what they want. To illustrate this, we give two
examples: graph pattern matching and top-k query answering.

(2) As remarked earlier, Boolean conjunctive queries are scale


independent even in the absence of access schema. A natural
question is: given a Boolean conjunctive query Q and a dataset
D on which Q is defined, how can we efficiently identify a core
of D for answering Q, in the absence of access schema?

Graph pattern matching revisited. We first review graph


pattern matching defined in terms of subgraph isomorphism.
Consider a social graph G = (V, E, fA ) and a graph pattern
Q = (VQ , EQ , fv ) as defined in Section 3.2. Consider a subgraph G = (V , E , fA ) of G, where V is a subset of V , and
E and fA are restrictions of E and fA on V , respectively.

(3) The third question concerns distributed pattern matching.


Does there exist a distributed algorithm at all that, given a
pattern query Q and a graph G that is partitioned into F =
(F1 , . . . , Fn ), computes the matches Q(G) of Q in G, such that
its response time and data shipment depend on the size of Q and
the largest fragment Fm of F only? This question asks about
the possibility or impossibility of distributed query processing
with certain performance guarantees. Recent work has shown
that this is beyond reach for distributed graph simulation (although distributed simulation has certain performance guarantees) [51]. However, the question remains open for distributed
pattern matching by, e.g., subgraph isomorphism.

We say that G matches Q by isomorphism, denoted as


QEiso G , if there is a bijective function h() : VQ V such that
u h(u) for each node u VQ , and
for each pair (u, u ) of nodes in VQ , (u, u ) EQ if and
only if (h(u), h(u )) E .
Graph pattern matching by subgraph isomorphism is to compute, given a social graph G and a graph pattern Q, the set Q(G)
of all subgraphs G of G such that QEiso G . This semantics
has been proposed for social graph analysis. However, it is intractable even in the classical computational complexity theory
to compute Q(G) based on subgraph isomorphism.

(4) A more general question asks about parallel scalability: for


a query class, does there exist an algorithm for answering its
queries such that the more processors are used, the less time it
takes? That is, if we could afford unlimited resources, then
a parallel scalable algorithm makes it feasible to answer the
queries on big data, by using more computing facilities. There
has been work on this issue. Unfortunately, the prior work focuses on either shared-memory architectures [72] or MapReduce [69, 89]. A standard notion of parallel scalability is not
yet in place for general shared-nothing architectures, which are

In light of the high complexity, we adopt graph simulation for


graph pattern matching instead of subgraph isomorphism [42].
That is, we check QEsim G (Section 3) rather than QEisoG for
subgraphs G of G. In fact, several revisions of graph simulation have been proposed, by allowing pattern edges to map
10

90]. An NPO A has a set I of instances, and for each instance


x I and each feasible solution y of x, there exists a positive
score m(x, y) indicating the quality measure of y. Consider a
function () from natural numbers to the range (0, 1].
An algorithm T is called a -approximation algorithm for
problem A if for each instance x I, T computes a feasible
solution y of x such that R(x, y) (|x|), where R(x, y) is the
performance ratio of y w.r.t. x, defined as follows [21]:

opt(x)

if A is a minimization problem

m(x, y)
R(x, y) = m(x, y)

opt(x) if A is a maximization problem

to paths [42], incorporating edge labels [41], and retaining the


topology of graph patterns [74]. These reduce the complexity
of graph pattern matching from intractability (subgraph isomorphism) to low polynomial time (quadratic time or cubic time).
Better still, it has been shown using real-life social networks
that graph pattern matching with (revisions of) graph simulation is able to capture more sensible matches in social graph
analysis than subgraph isomorphism can find. In other words,
by relaxing the semantics of graph pattern matching from subgraph isomorphism to (revised) graph simulation, we can find
high-quality matches for social data analysis in much less time.
Top-k graph pattern matching. As remarked earlier, even
quadratic-time or cubic-time complexity may be too high when
querying big data. In light of this, we may further relax the semantics of graph pattern matching defined with (revised) graph
simulation and hence reduce the cost of the computation.

where opt(x) is the optimal solution of x. That is, while the


solution y found by algorithm T (x) may not be optimal, it is
not too far from opt(x) (i.e., it is bounded by (|x|)).
However, such PTIME approximation algorithms directly
operate on the original instances of a problem, and may not
work well when querying big data for the following reasons.

In social data analysis we often want to find matches of a


particular pattern node uo in Q as query focus [11]. That is,
we just want those nodes in a social graph G that are matches
of uo in Q(G), rather than the entire set Q(G) of matches for
Q. Indeed, a recent survey shows that 15% of social queries
are to find matches of specific pattern nodes [78]. Moreover,
it often suffices to find top-k matches of uo in Q(G). More
specifically, assume a scoring function s() that given a match
v of uo , returns a non-negative real number s(v). For a positive
integer k, top-k graph pattern matching is to find a set U of
matches of uo in Q(G), such that U has exactly k matches and
moreover, for any k-element set U of matches of uo , s(U )
s(U ), where s(U ) is defined as vU s(v). When there exist
less than k matches of uo in Q(G), U includes all the matches
(see, e.g., [30], for top-k query answering).

(1) As we have seen in Section 2, PTIME algorithms on x may


be beyond reach in practice when x is big. Moreover, approximation algorithms are needed for problems that are traditionally
considered tractable [58], not limited to NPO.
(2) In contrast to NPOs that ask for a single optimum, answering a query Q in a dataset D is to find a set Q(D) of query
answers. Thus we need to revise the notion of performance ratios to assess the quality of a set of feasible answers.
Resource-bounded approximation. To cope with this, below
we propose resource-bounded approximation. In a nutshell,
given a small ratio (0, 1) and a query Q posed on a dataset
D, we extract a fraction DQ of D such that |DQ | |D|, and
compute approximate answers Q(DQ ). Here is called a resource ratio or a resolution. It is determined by our available
resources for query evaluation, such as time and space.
Intuitively, the idea is the same as how we process our photos.
When we cannot afford the time or storage for photos of high
resolution, we settle with smaller images with lower resolution
to reduce the cost, as long as such images are not too rough.
To formalize the idea, we first revise the notion of performance ratios for query answering. We then define resourcebounded approximation and demonstrate its effectiveness.

This suggests that we develop algorithms to find top-k


matches with the early termination property [30], i.e., they stop
as soon as a set of top-k matches is found, without computing
the entire Q(G). While the worst-case time complexity of such
algorithms may be no better than their counterparts for computing the entire Q(G), they may only need to inspect part of
big G, without paying the price of full-fledged graph pattern
matching. Indeed, for graph pattern matching defined in terms
of graph simulation, we find that top-k matching algorithms just
inspect 65%70% of the matches in Q(G) on average in reallife social graphs [48], even when diversity is taken into account
to remedy the over-specification problem of retrieving too homogeneous answers [56], which makes top-k query answering
a much harder bi-criteria optimization problem [24].

Accuracy of query answers. Consider a query Q and a dataset


D. The exact answers to Q in D are typically a set Q(D).
Suppose that an algorithm T computes a set Y of approximate
answers to Q in D. We define the precision and recall of the set
Y for (Q, D) in the standard way, as follows:

4.2. Data Driven Approximation


In some applications we may not be able to relax the semantics of our queries. To this end, we propose a data-driven
approximation strategy, referred to as resource-bounded approximation. Below we first review traditional approximation
schemes, and then introduce resource-bounded approximation.

|Y Q(D)|
,
|Y |
|Y Q(D)|
recall(Q, D, Y ) =
.
|Q(D)|

precision(Q, D, Y ) =

That is, precision is the ratio of the number of correct answers in


Y to the total number of answers in Y , while recall is the ratio of
the number of correct answers in Y to the total number of exact

Traditional approximation algorithms. Previous work on this


subject has mostly focused on developing PTIME approximation algorithms for NP-optimization problems (NPOs) [21, 58,
11

answers in Q(D). Based on these, we define the accuracy of Y


for (Q, D) by adopting the usual F -measure [93]:
accuracy(Q, D, Y ) = 2

are supported by Graph Search of Facebook, e.g., find me all


my friends in Beijing who like cycling [29].
A personalized search is specified by a graph pattern Q in
which a node up is designated to map to a particular node (person) vp in a social graph G. As in the case for top-k graph pattern matching described earlier, the pattern Q also has a particular output pattern node uo . The search is to compute Q(G),
the set of all matches of the output pattern node uo of Q in
graph G, while the personalized node up is mapped to vp
in G. Such searches are similar to what we have seen in Example 6. In contrast to queries given there, here we consider
queries Q that are graph patterns rather than relational queries,
and moreover, may not be scale independent in G.

precision(Q, D, Y ) recall(Q, D, Y )
precision(Q, D, Y )+recall(Q, D, Y )

as the harmonic mean of precision and recall. Obviously, the


larger accuracy(Q, D, Y ) is, the more accurate Y is.
When both Q(D) and Y are , i.e., no answer exists, we treat
accuracy(Q, D, Y ) as 1; we consider precision only if Q(D) is
but Y is not, and recall only if D is but Q(D) is not.
Resource-bounded query answering. We now present resourcebounded approximation algorithms. Let (0, 1) be a resource
ratio (or resolution), and Q be a class of queries.

For such patterns, we have developed resource-bounded approximation algorithms for graph pattern matching defined in
terms of subgraph isomorphism and graph simulation (see Section 3.2). We have experimented with these algorithms using
real-life social graphs. The results are very encouraging. We
find that our algorithms are efficient: they are 135 and 240
times faster than traditional pattern matching algorithms based
on graph simulation and subgraph isomorphism, respectively,
Better still, the algorithms are accurate: even when the resource
ratio is as small as 15106, the algorithms return matches
with 100% accuracy! Observe that when G consists of 1PB of
data, |G| is down to 15GB, i.e., resource-bounded approximation truly makes big data small, without paying too high a price
of sacrificing the accuracy of query answers.

Given a dataset D and a query Q in Q, an algorithm T for Q


queries with resource-bound does the following:
visits a fraction DQ of D such that |DQ | |D|, and
computes Q(DQ ) as approximate answers.
We say that T has accuracy ratio for Q if for all datasets
D and all queries Q LQ , accuracy(Q, D, Q(DQ )) .
Note that the accuracy ratio is in the range (0, 1]. When
= 1, algorithm T finds exact answers for all datasets D and
queries Q i.e., the algorithm has 100% accuracy.
Algorithm T consists of two steps: it first reduces big D to a
small DQ , and then computes approximate query answers, both
by accessing a bounded amount of data. Observe the following.

A similar idea has also been verified effective by BlinkDB


[4]. BlinkDB adaptively samples data to find approximate answers to relational queries within a probabilistic error-bound
and time constraints. In other words, it answers queries using
data samples DQ of a dataset D, instead of D.

(1) Dynamic reduction. Recall that traditional data reduction


schemes such as compression, summarization and data synopses, build the same structure for all queries [2, 9, 27, 53, 54,
65, 66, 70, 84, 91]. This is also how the strategies of Section 3.2
do. We refer to such strategies as uniform reduction.

Open issues. There is naturally more to be done.

In contrast, resource-bounded approximation adopts a dynamic reduction strategy, which finds a small dataset DQ with
only information needed for an input query Q, and hence, allows higher accuracy within the bound |D| on data accessed.
One can use any techniques for dynamic reduction, including
those for data synopses such as sampling and sketching, as long
as the process visits a bounded amount of data in D.

(1) For a class Q of queries, the first problem is to find, given


a resource ratio , the maximum provable accuracy ratio
that resource-bounded algorithms can guarantee for Q. A dual
problem is to find, given an accuracy guarantee , the minimum
resource ratio that resource-bounded algorithms can take.
(2) Another problem is to study, given an access schema A,
how can we develop a resource-bounded algorithm that makes
maximum use of A to retrieve data efficiently, i.e., it visits a
minimum amount of data that is not covered by A.

(2) Approximate query answering. Algorithm T computes


Q(DQ ) by accessing |D| amount of data rather than the entire
D. It aims to achieve the best performance ratio within |D|.

(3) The third topic is to develop resource-bounded approximation algorithms in various application domains. For instance,
for social searches that are not personalized, i.e., when no nodes
in a graph pattern are designated to map to fixed nodes in a social graph G, can we develop effective resource-bounded approximation algorithms for graph pattern matching?

(3) Scale independence. When Q is scale independent in


D w.r.t. some M |D|, resource-bounded approximation
achieves 100% accuracy, i.e., with performance ratio = 1.
(4) Access schema. The notion of resource-bounded approximation can be readily defined under an access schema A (see
Section 3.1), to efficiently retrieve a bounded amount of data
for query processing by leveraging indices and bounds in A.

(4) Finally, approximation classes for resource-bounded approximation need to be defined, along the same lines as
their counterparts for traditional approximation algorithms
(e.g., APX, PTAS, FPTAS [21]). Similarly, approximationpreserving reductions should be developed, and complete prob-

Personalized social search. To verify the effectiveness of the


approach, we have conducted a preliminary study of personalized social search in real-life social graphs [52]. Such searches
12

lems for those classes need to be identified for these classes.

then t[AC] must be 10. As a data quality rule, this CFD catches
the inconsistency in tuple t1 : t1 [AC] and t[city] violate the CFD.

5. Data Quality: The Other Side of Big Data

Data accuracy refers to the closeness of values in a database


to the true values of the entities that the database values represent. Observe that data may be consistent but not accurate. For
instance, one may have a rule for data consistency: age 120,
indicating that a persons age does not exceed 120. Consider
a tuple t representing a high school student, with t[age] = 40.
While t is not inconsistent, it may not be accurate: a high school
student is typically no older than 19 years old.

We have so far focused only on how to cope with the volume


(quantity) of big data. Nonetheless, as remarked earlier, big
data = quantity + quality. This section addresses data quality
issues. We report the state of the art of this line of research,
and identify challenges introduced by big data. The primary
purpose of this section is to advocate the study of the quality
of big data, which has been overlooked by and large, although
data quality and data quantity are equally important.

There has been recent work on data accuracy [16]: given tuples t1 and t2 pertaining to the same entity e, we decide whether
t1 is more accurate than t2 in the absence of the true value of e.
It is also based on integrity constraints as data quality rules.

5.1. Central Issues of Data Quality


We begin with an overview of central technical issues in connection with data quality. We then present current approaches
to tackling these issues. We invite the interested reader to consult [33] for a recent survey on the subject.

Information completeness concerns whether our database has


complete information to answer our queries. Given a database
D and a query Q, we want to know whether the complete answer to Q can be found by using only the data in D. As shown
in Example 2, when D does not include complete information
for a query, the answer to the query may not be correct.

Data quality problems. Data in the real world is often dirty.


It is common to find real-life data inconsistent, inaccurate, incomplete, out of date and duplicated. Error rate of business
data is approximately 1%5%, and for some companies it is
above 30% [83]. In most data warehouse projects, data cleaning accounts for 30%-80% of the development time and budget [87], for improving the quality of the data rather than for
developing the systems. When it comes to incomplete information, it is estimated that pieces of information perceived as
being needed for clinical decisions were missing from 13.6%
to 81% of the time [76]. When data currency is concerned, it
is known that 2% of records in a customer file become obsolete in one month [26]. That is, in a database of 500 000 customer records, 10 000 records may go stale per month, 120 000
records per year, and within two years about 50% of all the
records may be obsolete. As remarked earlier, the scale of the
data quality problem is far worse in the context of big data.

Information completeness has been a longstanding problem.


A theory of relative information completeness has recently been
proposed [32], to decide whether our database has complete information to answer our queries, and if not, how we can expand
the database and make it complete, by including more data.
Data currency is also known as timeliness. It aims to identify
the current values of entities, and to answer queries with the
current values, in the absence of valid timestamps.
For example, recall the dataset D0 from Figure 1. Suppose
that we know that tuples t1 , t2 and t3 refer to the same person
Mary. Note that these tuples have two distinct values for salary:
50k and 80k, one is current and the other is stale. We want to
decide which one is current, when their timestamps are missing.
A data currency theory has recently been proposed in [40], to
deduce data currency when temporal information is only partly
known or not available at all. It is based on data quality rules
defined in terms of temporal constraints. For instance, we can
specify a rule asserting that the salary of each employee in a
company does not decrease, as commonly found in the real
world. Then we can deduce that Marys current salary is 80k.

Why do we care about dirty data? As shown in Example 2,


we may not get correct query answers if our data is dirty. As a
result, dirty data routinely leads to misleading analytical results
and biased decisions, and accounts for loss of revenues, credibility and customers. For example, it is reported that dirty data
cost US businesses 600 billion dollars every year [26].
Below we highlight five central issues of data quality.

Data deduplication aims to identify tuples in one or more relations that refer to the same real-world entity. It is also known as
entity resolution, duplicate detection, record matching, record
linkage, merge-purge, database hardening, and object identification (for data with complex structures such as graphs).

Data consistency refers to the validity and integrity of data representing real-world entities. It aims to detect inconsistencies
or conflicts in the data. For instance, tuple t1 of Figure 1 is
inconsistent: its area code is 20 while its city is Beijing.

For example, consider tuples t1 , t2 and t3 in Figure 1. To answer query Q0 of Example 1, we want to know whether these
tuples refer to the same employee Mary. The answer is affirmative if, e.g., there exists another relation which indicates that
Mary Smith and Mary Webber have the same email account.

Inconsistencies are identified as violations of data dependencies (a.k.a. integrity constraints [1]). Errors in a single relation can be detected by intrarelation constraints such as conditional functional dependencies (CFDs) [34], while errors across
different relations can be identified by interrelation constraints
such as conditional inclusion dependencies (CINDs) [75]. An
example CFD for the data of Figure 1 is: city = Beijing
AC = 10, asserting that for any tuple t, if t[city] = Beijing,

The need for studying data deduplication is evident in data


cleaning, data fusion and payment card fraud detection, among
other things. No matter how important it is, data deduplication
13

we want to decide whether D has complete information to answer an input query Q, among other things.

is nontrivial. Tuples pertaining to the same object may have


different representations in various data sources. Moreover, the
data sources may contain errors. These make it hard, if not impossible, to match a pair of tuples by simply checking whether
their attributes pairwise equal. Worse still, it is often too costly
to compare and examine every pair of tuples from big data.

For a centralized database D, given a set of CFDs and


CINDs, a fixed number of SQL queries can be automatically generated such that, when being evaluated against D, the
queries return all and only those tuples in D that violate [33].
That is, we can effectively detect inconsistencies by leveraging
existing facility of commercial relational database systems.

Data deduplication is perhaps the most extensively studied


topic of data quality. A variety of approaches have been proposed (see [63] for a survey). In particular, a class of dynamic
constraints has been studied for data deduplication, known as
matching dependencies (MDs), as data quality rules [31].

Data repairing. After the errors are detected, we want to automatically localize the errors and fix the errors. We also need
to identify tuples that refer to the same entity, and for each entity, determine its latest and most accurate values from the data
in our database. When some data is missing, we need to decide
what data we should import and where to import it from, so that
we will have sufficient information for tasks at hand.

Improving data quality. We have seen that real-life data is


often dirty, and dirty data is costly. In light of these, effective
techniques have to be in place to improve data quality. To do
this, a central question concerns how we can tell whether our
data is dirty or clean. To this end, we need data quality rules to
detect semantic errors in our data and fix those errors. A number of dependency (constraint) formalisms have been proposed
as data quality rules, and are being used in industry, e.g., CFDs,
CINDs and MDs. Below we briefly describe the basic functionality of a rule-based system for data quality management.

This highlights the need for data repairing [5]. Given a set
of dependencies and an instance D of a database schema R, it
is to find a candidate repair of D, i.e., another instance D of R
such that D satisfies and D minimally differs from the original database D. The data repairing problem is, nevertheless,
highly nontrivial: it is NP-complete even when a fixed set of
traditional functional dependencies (FDs) or a fixed set of inclusion dependencies (INDs) is used as data quality rules [14]. In
light of these, several heuristic algorithms have been developed,
to effectively repair data by employing FDs and INDs [14],
CFDs [20, 96], CFDs and MDs [46] as data quality rules.

Discovering data quality rules. To use dependencies as data


quality rules, it is necessary to have efficient techniques in place
that can automatically discover dependencies from data. Indeed, it is unrealistic to just rely on human experts to design
data quality rules via an expensive and long manual process, or
count on business rules that have been accumulated. This suggests that we learn informative and interesting data quality rules
from (possibly dirty) data, and prune away insignificant rules.

The data repairing methods mentioned above are essentially


heuristic: while they improve the overall quality, they do not
guarantee to find correct fixes for each error detected, i.e., they
do not warrant a precision and recall of 100%. Worse still, they
may introduce new errors when trying to repair the data. Hence,
they are not accurate enough to repair critical data such as clinical data, in which a minor error may have disastrous consequences. This highlights the quest for effective methods to find
certain fixes that are guaranteed correct. Such a method has
been developed in [43]. It guarantees that whenever it updates
data, it correctly fixes an error without introducing new errors.

More specifically, given a database D, the discovery problem is to find a minimal cover of all dependencies (e.g., CFDs,
CINDs, MDs) that hold on D, i.e., a non-redundant set of dependencies that is logically equivalent to the set of all dependencies that hold on D. Several algorithms have been developed for discovering CFDs and MDs (e.g., [18, 35, 55]).
Validating data quality rules. A given set of dependencies,
either automatically discovered or manually designed by domain experts, may be dirty itself. In light of this we have to
identify consistent dependencies from , i.e., those rules that
make sense, to be used as data quality rules. Moreover, we need
to remove redundancies from via the implication analysis of
the dependencies, to speed up data cleaning process.

The rule discovery, rule validation, error detection and data


repairing methods mentioned above have been supported by
commercial systems and have proven effective in industry.
5.2. New Challenges Introduced by Big Data
Previous work on data quality has mostly focused on relational data residing in a centralized database. To improve
the quality of big data and hence, get sensible answers to our
queries in big data, new techniques have to be developed.

This problem is nontrivial. It is NP-complete to decide


whether a given set of CFDs is satisfiable [34]. Nevertheless,
there has been an approximation algorithm for extracting a set
of consistent rules from a set of possibly inconsistent
CFDs, while guaranteeing that is within a constant bound
of the maximum consistent subset of (see [34] for details).

Repairing distributed data. Big data is often distributed. In


the distributed setting, all the data quality issues mentioned
above become more challenging. For example, consider error
detection. As remarked earlier, this is simple in a centralized
database system: SQL queries can be automatically generated
so that we can execute them against our database and catch all
inconsistencies and conflicts. In contrast, this is more intriguing
in distributed data: it necessarily requires us to ship data from

Detecting errors. After a validated set of data quality rules is


identified, the next question concerns how to effectively catch
errors in a database by using these rules. Given a set of consistent data quality rules and a database D, we want to detect inconsistencies in D, i.e., to find all tuples in D that violate some
rule in . When it comes to relative information completeness,
14

one site to another. In this setting, error detection with minimum data shipment or minimum response time becomes NPcomplete [37], and the SQL-based techniques no longer work.
For distributed data, effective batch algorithms [37] and incremental algorithms [44] have been developed for detecting
errors, with certain performance guarantees. However, rule discovery and data repairing algorithms remain to be developed
for distributed data. These are highly challenging. For instance,
data repairing for centralized databases is already NP-complete
even when a fixed set of FDs is taken as data quality rules [14],
i.e., when only the size |D| of datasets is concerned (a.k.a. data
complexity [1]). When D is of PB size and D is distributed, its
computational and communication costs are prohibitive.

Coupling with knowledge bases. A large part of big data


comes from Web sources or social networks. To improve the
quality of such data, we ultimately have to use knowledge bases
and ontology. A number of knowledge bases are being developed, such as Knowledge Graph [57], Yago [95], and Wiki [94].
However, the quality of these knowledge bases needs to be improved themselves. This suggests that we study the following. How to detect inconsistencies and conflicts in a knowledge
base? How to repair a knowledge base? How to make use of
available knowledge bases to clean data from the Web?
6. Conclusion
We have reported an account of recent work of the International Research Center on Big Data at Beihang University, on
querying big data. Our main conclusion is as follows.

Deducing the true values of entities. To answer a query in


big data, we may have to use data from tens of thousands
sources [22]. With this comes the need for data fusion and conflict resolution [13]. That is, for each entity e, we need to identify the set De of data items that refer to the same e from those
sources, and moreover, deduce the true value of e from De .

Query answering in big data is radically different from


what we know about querying traditional databases.
We need to revise complexity theory and approximation
theory to characterize what we can do and what is impossible for computing exact or approximate query answers.

Example 8. Recall Figure 1. Suppose that t1 , t2 and t3 come


from different sources. We need data deduplication methods to
determine whether they refer to the same person Mary. If so, we
want to find the true values of Mary. To do this, we may need
to, e.g., reason about both data currency and consistency. As
an example, for attribute LN (last name), Mary has two conflict
values: Smith and Webber. We want to know what is the latest
and correct value. To this end, we know that marital status can
only change from single to married, and that her last name and
marital status are correlated. From these we can deduce that
the true value of LN of Mart is Webber.
As another example, suppose that s1 and s2 of Figure 1 refer
to the same person. To deduce the true value of his FN (first
name), we may use a CFD: FN = Bob FN = Robert.
This rule for data consistency allows us to normalize the FN
attribute and change nickname Bob to Robert.


Querying big data is challenging, but doable. It calls for a


set of new effective query processing techniques.
Big data = quantity + quality. These are the two sides of
the same coin, and neither works well when taken alone.
Summing up, we believe that the need for studying query answering in big data cannot be overstated, and that the subject is
a rich source of questions and vitality. We reiterate our invitation to interested colleagues to join us in the study.
Acknowledgments. Fan and Huai are supported in part by 973
Program 2014CB340302. Fan is also supported in part by NSFC
61133002, 973 Program 2012CB316200, Guangdong Innovative
Research Team Program 2011D005 and Shenzhen Peacock Program 1105100030834361, China, EPSRC EP/J015377/1, UK, and
NSF III 1302212, US.

From the example we can see that to deduce the true values of an entity, we need to combine several techniques: data
deduplication, data consistency and data currency, among other
things. This can be done in a uniform logical framework based
on data quality rules. There has been recent preliminary work
on the topic [39]. Nonetheless, there is much more to be done.

References
[1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. AddisonWesley, 1995.
[2] S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses
for approximate query answering. In SIGMOD, pages 275286, 1999.
[3] F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT, pages 99110, 2010.
[4] S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica.
BlinkDB: queries with bounded errors and bounded response times on
very large data. In EuroSys, pages 2942, 2013.
[5] M. Arenas, L. Bertossi, and J. Chomicki. Consistent query answers in
inconsistent databases. In PODS, pages 6879, 1999.
[6] M. Armbrust, K. Curtis, T. Kraska, A. Fox, M. J. Franklin, and D. A.
Patterson. PIQL: Success-tolerant query processing in the cloud. PVLDB,
5(3):181192, 2011.
[7] M. Armbrust, A. Fox, D. A. Patterson, N. Lanham, B. Trushkowsky,
J. Trutna, and H. Oh. SCADS: Scale-independent storage for social computing applications. In CIDR, 2009.
[8] M. Armbrust, E. Liang, T. Kraska, A. Fox, M. J. Franklin, and D. Patterson. Generalized scale independence through incremental precomputation. In SIGMOD, pages 625636, 2013.

Cleaning data with complex structures. Data quality techniques have been mostly studied for structured data with a regular structure and a schema, such as relational data. When it
comes to big data, however, data typically has an irregular structure and does not have a schema. For example, an entity may be
represented as a subgraph in a large graph, such as a person in a
social graph. In this context, all the central issues of data quality
have to be revisited. These are far more challenging than their
counterparts for relational data, and effective techniques are not
yet in place. Consider data deduplication, for instance. Given
two graphs (without a schema), we want to determine whether
they represent the same object. To do this, we need to extend
data quality rules from relations to graphs.
15

[9] B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for


approximate query processing. In SIGMOD, pages 539550, 2003.
[10] Beihang University. International Research Center at Big Data.
http://rcbd.buaa.edu.cn/en/index.html.
[11] M. Bendersky, D. Metzler, and W. Croft. Learning concept importance
using a weighted dependence model. In WSDM, pages 3140, 2010.
[12] M. Bienvenu, B. ten Cate, C. Lutz, and F. Wolter. Ontology-based data
access: a study through disjunctive datalog, CSP, and MMSNP. In PODS,
pages 213224, 2013.
[13] J. Bleiholder and F. Naumann. Data fusion. ACM Comput. Surv., 41(1),
2008.
[14] P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model
and effective heuristic for repairing constraints by value modification. In
SIGMOD, pages 143154, 2005.
[15] J. Brynielsson, J. Hogberg, L. Kaati, C. Martenson, and P. Svenson. Detecting social positions using simulation. In ASONAM, pages 4855,
2010.
[16] Y. Cao, W. Fan, and W. Yu. Determining the relative accuracy of attributes. In SIGMOD, pages 565576, 2013.
[17] Y. Cao, W. Fan, and W. Yu. Bounded conjunctive queries. PVLDB, pages
1231 1242, 2014.
[18] F. Chiang and R. J. Miller. Discovering data quality rules. PVLDB,
1(1):11661177, 2008.
[19] J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated Web
collections. SIGMOD Rec., 29(2):355366, 2000.
[20] G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality:
Consistency and accuracy. In VLDB, pages 315326, 2007.
[21] P. Crescenzi, V. Kann, and M. Halldorsson. A compendium of NP optimization problems.
http://www.nada.kth.se/viggo/wwwcompendium/.
[22] N. N. Dalvi, A. Machanavajjhala, and B. Pang. An analysis of structured
data on the Web. PVLDB, 5(7):680691, 2012.
[23] J. Dean and S. Ghemawat. MapReduce: simplified data processing on
large clusters. Commun. ACM, 51(1):107113, 2008.
[24] T. Deng and W. Fan. On the complexity of query result diversification.
PVLDB, 6(8):577588, 2013.
[25] R. Dorrigiv, A. Lopez-Ortiz, and A. Salinger. Optimal speedup on a lowdegree multi-core parallel architecture (LoPRAM). In SPAA, pages 185
187, 2008.
[26] W. W. Eckerson. Data quality and the bottom line: Achieving business
success through a commitment to high quality data. Technical report, The
Data Warehousing Institute, 2002.
[27] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm
for discovering clusters in large spatial databases with noise. In KDD,
pages 226231, 1996.
[28] Facebook. http://newsroom.fb.com.
[29] Facebook. Introducing Graph Search.
https://en-gb.facebook.com/about/graphsearch.
[30] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for
middleware. JCSS, 66(4):614656, 2003.
[31] W. Fan, H. Gao, X. Jia, J. Li, and S. Ma. Dynamic constraints for record
matching. VLDB J., 20(4):495520, 2011.
[32] W. Fan and F. Geerts. Relative information completeness. ACM Trans.
on Database Systems, 35(4), 2010.
[33] W. Fan and F. Geerts. Foundations of Data Quality Management. Morgan
& Claypool Publishers, 2012.
[34] W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. ACM Trans. on
Database Systems, 33(1), 2008.
[35] W. Fan, F. Geerts, J. Li, and M. Xiong. Discovering conditional functional
dependencies. TKDE, 23(5):683698, 2011.
[36] W. Fan, F. Geerts, and L. Libkin. On scale independence for querying big
data. In PODS, pages 5162, 2014.
[37] W. Fan, F. Geerts, S. Ma, and H. Muller. Detecting inconsistencies in
distributed data. In ICDE, pages 6475, 2010.
[38] W. Fan, F. Geerts, and F. Neven. Making queries tractable on big data
with preprocessing. PVLDB, 6(8):577588, 2013.
[39] W. Fan, F. Geerts, N. Tang, and W. Yu. Inferring data currency and consistency for conflict resolution. In ICDE, pages 470481, 2013.
[40] W. Fan, F. Geerts, and J. Wijsen. Determining the currency of data. ACM
Trans. on Database Systems, 37(4), 2012.
[41] W. Fan, J. Li, S. Ma, N. Tang, and Y. Wu. Adding regular expressions to

graph reachability and pattern queries. In ICDE, pages 3950, 2011.


[42] W. Fan, J. Li, S. Ma, N. Tang, Y. Wu, and Y. Wu. Graph pattern matching:
From intractability to polynomial time. PVLDB, 3(1):11611172, 2010.
[43] W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with
editing rules and master data. VLDB J., 21(2):213238, 2012.
[44] W. Fan, J. Li, N. Tang, and W. Yu. Incremental detection of inconsistencies in distributed data. TKDE, 2014.
[45] W. Fan, J. Li, X. Wang, and Y. Wu. Query preserving graph compression.
In SIGMOD, pages 157168, 2012.
[46] W. Fan, S. Ma, N. Tang, and W. Yu. Interaction between record matching
and data repairing. ACM J. of Data and Information Quality, 2014.
[47] W. Fan, X. Wang, and Y. Wu. Performance guarantees for distributed
reachability queries. PVLDB, 5(11):13041315, 2012.
[48] W. Fan, X. Wang, and Y. Wu. Diversified top-k graph pattern matching.
PVLDB, 6(13):15101521, 2013.
[49] W. Fan, X. Wang, and Y. Wu. Incremental graph pattern matching. ACM
Trans. on Database Systems, 38(3), 2013.
[50] W. Fan, X. Wang, and Y. Wu. Answering graph pattern queries using
views. In ICDE, pages 184195, 2014.
[51] W. Fan, X. Wang, and Y. Wu. Distributed graph simulation: Impossibility
and possibility. PVLDB, pages 1083 1094, 2014.
[52] W. Fan, X. Wang, and Y. Wu. Querying big graphs within bounded resources. In SIGMOD, pages 301312, 2014.
[53] M. N. Garofalakis and P. B. Gibbons. Wavelet synopses with error guarantees. In SIGMOD, pages 476487, 2004.
[54] P. B. Gibbons and Y. Matias. Synopsis data structures for massive data
sets. In SODA, pages 909910, 1999.
[55] L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating
near-optimal tableaux for conditional functional dependencies. PVLDB,
1(1):376390, 2008.
[56] S. Gollapudi and A. Sharma. An axiomatic approach for result diversification. In WWW, pages 381390, 2009.
[57] Google. Knowledge Graph.
http://www.google.co.uk/insidesearch/features/ search/knowledge.html.
[58] R. Greenlaw, H. J. Hoover, and W. L. Ruzzo. Limits to Parallel Computation: P-Completeness Theory. Oxford University Press, 1995.
[59] A. Y. Halevy. Answering queries using views: A survey. VLDB J.,
10(4):270294, 2001.
[60] J. Hartmanis and R. E. Stearns. On the computational complexity of algorithms. Trans. American Mathematical Society, 117:285306, May 1965.
[61] J. M. Hellerstein. The declarative imperative: Experiences and conjectures in distributed logic. SIGMOD Record, 39(1):519, 2010.
[62] M. R. Henzinger, T. Henzinger, and P. Kopke. Computing simulations on
finite and infinite graphs. In FOCS, pages 453462, 1995.
[63] T. N. Herzog, F. J. Scheuren, and W. E. Winkler. Data Quality and Record
Linkage Techniques. Springer, 2009.
[64] IBM. IBM big data platform.
http://www-01.ibm.com/software/data/bigdata/.
[65] Y. E. Ioannidis and V. Poosala. Histogram-based approximation of setvalued query-answers. In VLDB, pages 174185, 1999.
[66] H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik,
and T. Suel. Optimal histograms with quality guarantees. In VLDB, pages
275286, 2009.
[67] D. S. Johnson. A catalog of complexity classes. In Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity (A). The
MIT Press, 1990.
[68] N. D. Jones. An introduction to partial evaluation. ACM Comput. Surv.,
28(3):480503, 1996.
[69] H. J. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for
MapReduce. In SODA, pages 938948, 2010.
[70] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley, 1990.
[71] P. Koutris and D. Suciu. Parallel evaluation of conjunctive queries. In
PODS, pages 223234, 2011.
[72] C. P. Kruskal, L. Rudolph, and M. Snir. A complexity theory of efficient
parallel algorithms. TCS, 71(1):95132, 1990.
[73] M. Lenzerini. Data integration: A theoretical perspective. In PODS,
pages 233246, 2002.
[74] S. Ma, Y. Cao, W. Fan, J. Huai, and T. Wo. Strong simulation: Capturing
topology in graph pattern matching. ACM Trans. on Database Systems,
39(1), 2014.

16

[75] S. Ma, W. Fan, and L. Bravo. Extending inclusion dependencies with


conditions. TCS, pages 6495, 1998.
[76] D. W. Miller Jr., J. D. Yeast, and R. L. Evans. Missing prenatal records
at a birth center: A communication problem quantified. In AMIA Annu
Symp Proc., pages 535539, 2005.
[77] R. Milner. Communication and Concurrency. Prentice Hall, 1989.
[78] M. Morris, J. Teevan, and K. Panovich. What do people ask their social
networks, and why? A survey study of status message Q&A behavior. In
CHI, pages 17391748, 2010.
[79] L. D. Nardo, F. Ranzato, and F. Tapparo. The subgraph similarity problem. TKDE, 21(5):748749, 2009.
[80] A. Ntoulas, J. Cho, and C. Olston. Whats new on the Web? The evolution
of the Web from a search engine perspective. In WWW, pages 1 12,
2004.
[81] C. H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994.
[82] G. Ramalingam and T. Reps. On the computational complexity of dynamic graph problems. TCS, 158(1-2):213224, 1996.
[83] T. Redman. The impact of poor data quality on the typical enterprise.
Commun. ACM, 2:7982, 1998.
[84] P. Rosch and W. Lehner. Sample synopses for approximate answering of
group-by queries. In EDBT, pages 403414, 2009.
[85] G. Santos.
SSD ranking:
The fastest solid state drives.
http://www.fastestssd.com/
featured/ssd-rankings-the-fastest-solidstate-drives/#pcie, Oct 2012.

[86] T. K. Sellis. Personalization in web search and data management. In


Model and Data Engineering, pages 11. Springer, 2011.
[87] C. C. Shilakes and J. Tylman. Enterprise information portals. Technical
report, Merrill Lynch, Inc., New York, NY, Nov. 1998.
[88] D. Suciu and V. Tannen. A query language for NC. J. Comput. Syst. Sci.,
55(2):299321, 1997.
[89] Y. Tao, W. Lin, and X. Xiao. Minimal MapReduce algorithms. SIGMOD,
pages 529540, 2013.
[90] V. V. Vazirani. Approximation Algorithms. Springer, 2003.
[91] J. S. Vitter and M. Wang. Approximate computation of multidimensional
aggregates of sparse data using wavelets. In SIGMOD, pages 193204,
1999.
[92] Wikipedia. Big data.
http://en.wikipedia.org/wiki/Big data#cite note-23.
[93] Wikipedia. F-measure.
http://en.wikipedia.org/wiki/Precision and recall.
[94] Wikipedia. Wiki.
http://en.wikipedia.org/wiki/Wiki.
[95] Wikipedia. Yago.
http://en.wikipedia.org/wiki/YAGO (database).
[96] M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas.
Guided data repair. PVLDB, 4(5):279289, 2011.

[97] L. Zou, L. Chen, and M. T. Ozsu.


Distance-join: Pattern match query in
a large graph database. In PVLDB, pages 886897, 2009.

17

You might also like