Schema Inference For Massive JSON Datasets
Schema Inference For Massive JSON Datasets
ABSTRACT
In the recent years JSON affirmed as a very popular data Big Data applications typically process and analyze very
format for representing massive data collections. JSON data large structured and semi-structured datasets. In many of
collections are usually schemaless. While this ensures sev- these applications, and in those relying on NoSQL docu-
eral advantages, the absence of schema information has im- ment stores in particular, data are represented in JSON
portant negative consequences: the correctness of complex (JavaScript Object Notation) [10], a data format that is
queries and programs cannot be statically checked, users widely used thanks to its flexibility and simplicity.
cannot rely on schema information to quickly figure out the JSON data collections are usually schemaless. This en-
structural properties that could speed up the formulation of sures several advantages: in particular it enables applica-
correct queries, and many schema-based optimizations are tions to quickly consume huge amounts of semi-structured
not possible. data without waiting for a schema to be specified. Unfor-
In this paper we deal with the problem of inferring a tunately, the lack of a schema makes it impossible to stati-
schema from massive JSON datasets. We first identify a cally detect unexpected or unwanted behaviours of complex
JSON type language which is simple and, at the same time, queries and programs (i.e., lack of correctness), users cannot
expressive enough to capture irregularities and to give com- rely on schema information to quickly figure out structural
plete structural information about input data. We then properties that could speed up the formulation of correct
present our main contribution, which is the design of a schema queries, and many schema-based optimizations are not pos-
inference algorithm, its theoretical study, and its implemen- sible.
tation based on Spark, enabling reasonable schema infer- In this paper we deal with the problem of inferring a
ence time for massive collections. Finally, we report about schema from massive JSON datasets. Our main goal in this
an experimental analysis showing the effectiveness of our ap- work is to infer structural properties of JSON data, that is,
proach in terms of execution time, precision, and conciseness a description of the structure of JSON objects and arrays
of inferred schemas, and scalability. that takes into account nested values and the presence of
optional values. These are the main properties that charac-
terize semi-structured data, and having a tool that ensures
CCS Concepts fast, precise, and concise inference is crucial in modern appli-
•Information systems → Semi-structured data; Data cations characterized by agile consumption of huge amounts
model extensions; •Theory of computation → Type of data coming from multiple and disparate sources.
theory; Logic; The approach we propose here is based on a JSON schema
language able to capture structural irregularities and com-
Keywords plete structural information about input data. This lan-
guage resembles and borrows mechanisms from existing pro-
JSON, schema inference, map-reduce, Spark, big data col-
posals [20], but it has the advantage to be simple yet very
lections
expressive.
The proposed technique infers a schema that provides a
1. INTRODUCTION global description of the whole input JSON dataset, while
having a size that is small enough to enable a user to consult
it in a reasonable amount of time, in order to get a global
knowledge of the structural and type properties of the JSON
collection. The description of the input JSON collection is
global in the sense that each path that can be traversed in
2017,
c Copyright is with the authors. Published in Proc. 20th Inter- the tree-structure of each input JSON value can be traversed
national Conference on Extending Database Technology (EDBT), March
21-24, 2017 - Venice, Italy: ISBN 978-3-89318-073-8, on OpenProceed-
in the inferred schema as well. This property is crucial to
ings.org. Distribution of this paper is permitted under the terms of the Cre- enable a series of query optimization tasks. For instance,
ative Commons license CC-by-nc-nd 4.0
223
{"A": 123 resulting type and decorated with a question mark
"B": " The ..." ?.
"C": false To illustrate those cases, assume that T1 and T2 are,
"D": [ " abc " , " cde " , " fr 1 2 " ] respectively, {A:Str, B:Num} and {B:Bool, C:Str}.
}
The only matching key is “B” and hence its two atomic
Figure 1: A JSON record r1 .
types Num and Bool are fused, which yields Num + Bool.
The other keys will be optional according to rule R2 .
Type inference. Hence, fusion yields the type
Type inference, during the Map phase, is dedicated to T12 = {(A:Str)?, B:Num + Bool, (C:Str)?}
inferring individual types for the input JSON values, and
yields a set of distinct types to be fused during the Reduce Assume now that T12 is fused with
phase. Some proposals of JSON schemas exist in the liter- T3 = {A:Null, B:Num}
ature. With one exception [20], none of them uses regular
expressions which, as we shall illustrate, are important for Rules R1 and R2 need to be slightly adapted to deal
concisely representing types for array values. Moreover, a with optional types. Intuitively, we should simply con-
clean formal semantics specification of types is often miss- sider that optionality ‘?’ prevails over the implicit total
ing in these works, hence making it difficult to understand cardinality ‘1’. The resulting type is thus
their precise meaning. T123 = {(A:Str + Null)?, B:Num + Bool, (C:Str)?}.
The type language we adopt is meant to capture the core
features of the JSON data model with an emphasis on suc- Fusion of nested records eventually associates keys with
cinctness. Intuitively, basic values are captured using stan- types that may be unions of atomic types, record types,
dard data types (i.e., String, Number, Boolean), complex and array types. We will see that, when such types
values are captured by introducing record and array type are merged, we separately merge the atomic types, the
constructors, and a union type constructor is used to add record types, and the array types, and return the union
flexibility and expressive power. To illustrate the type lan- of the result. For instance, the fusion of types
guage, observe the following type that is inferred for the {l:(Bool + Str + {A:Num}}
record r1 given in Figure 1:
and
{A : Num, B : Str, C : Bool, D : [Str, Str, Str]}
{l:(Bool + {A:Str, B:Num})}
As we will show, the initial type inference is a quite simple
and fast operation: it consists of a simple traversal of the yields
input values that produces a type that is isomorphic to the {l:(Bool + Str + {A:(Num + Str), (B:Num)?}}.
value itself.
• Array types: array fusion deserves special attention. A
Type fusion. particular aspect to consider is that an array type ob-
Type fusion is the second step of our approach and consists tained in the first phase may contain several repeated
in iteratively merging the types produced during the Map types, and may feature mixed-content. To deal with
phase. Because it is performed during the Reduce phase in this, before fusing types we perform a kind of simpli-
a distributed fashion, type fusion relies on a fusion operator fication on bodies by using regular expression types,
which enjoys the commutativity and associativity proper- and, in particular, union + and repetition ∗. To illus-
ties. This fusion operator is invoked over two types T1 and trate this point, consider the array value
T2 , and produces a supertype of the inputs. To do so, the
[00 abc00 ,00 cde00 , {00E 00 :00f r00 , 00F 00 : 12}],
fusion collapses the parts of T1 and T2 that are identical and
preserves the parts that are distinct in both types. To this containing two strings followed by a record (mixed-
end, T1 and T2 are processed in a synchronised top-down content). The first phase infers for this value the type
manner in order to identify common parts. The main idea
[Str, Str, {E:Str, F :Num}].
is to represent only once what is common, and, at the same
time, to preserve all the parts that differ. This type can be actually simplified. For instance, one
Fusion treats atomic types, record types, and array types can think of a partition-based approach which collapses
differently, as follows. adjacent identical types into a star-guarded type, thus
transforming
• Atomic types: the fusion of atomic types is obvious, as
identical types are collapsed while different types are [Str, Str, {E:Str, F :Num}]
combined using the union operator.
into
• Record types: recall that valid record types enjoy key [(Str)∗, {E:Str, F :Num}]
uniqueness. Therefore, the fusion of T1 and T2 is led
by two rules: by collapsing the string types. The resulting schema
is indeed succinct and precise. However, succinctness
(R1 ) matching keys from both types are collapsed and cannot be guaranteed after fusion. For instance, if that
their respective types are recursively fused; type was to be merged with
(R2 ) keys without a match are deemed optional in the [{E:Str, F :Num}, Str, Str],
224
where strings and record swapped positions, succinct- as well as a theoretical study of its related expressive power
ness would be lost because we need to duplicate at least and validation problem. While that work does not deal with
one sub-expression, (Str)∗ or {E:Str, F :Num}. As we the schema inference problem, our schema language can be
are mainly concerned with generating types that can seen as a core part of the JSON Schema language studied
be human-readable, we trade some precision for suc- therein, and shares union types and repetition types with
cinctness and do not account for position anymore. To that one. These constructors are at the basis of our tech-
achieve this, in our simplification process (made before nique to collapse several schemas into a more succinct one.
fusing array types) we generalize the above partition- An alternative proposal for typing JSON data is JSound
based solution by returning the star-guarded union of [2]. That language is quite restrictive wrt ours and JSON
all distinct types expressed in an array. So, simplifica- Schemas: for instance it lacks union types.
tion for either In a very recent work [13] Abadi and Discala deal with
the problem of automatic transforming denormalised, nested
[Str, Str, {E:Str, F :Num}]
JSON data into normalised relational data that can be stored
or into a RDBMS; this is achieved by means of a schema gener-
ation algorithm that learns the normalised, relational schema
[{E:Str, F :Num}, Str, Str]
from data. Differently from that work, we deal with schemas
yields the same type that are far from being relational, and are closer to tree reg-
ular grammars [17]. Furthermore, the approach proposed in
S = [(Str + {E:Str, F :Num})∗].
[13] ignores the original structure of the JSON input dataset
After the array types have been simplified in this man- and, instead, depends on patterns in the attribute data val-
ner, they are fused by simply recursively fusing their ues (functional dependencies) to guide its schema genera-
content types, applying the same technique described tion. So, that approach is complementary to ours.
for record types: when the body type is a union type, In [15] Liu et al. propose storage, querying, and indexing
we separately merge the atomic components, the array principles enabling RDBMSs to manage JSON. The paper
components, and the record components, and take the does not deal with schema inference, but indicates a pos-
union of the results. sible optimisation of their framework based on the identifi-
cation of common attributes in JSON objects that can be
3. RELATED WORK captured by a relational schema for optimization purposes.
In [21] Scherzinger et al. propose a plugin to track changes in
The problem of inferring structural information from JSON
object-NoSQL mappings. The technique is currently limited
data collections has recently gained attention of the database
to only detect mismatches between base types (e.g., Boolean,
research community. The closest work to ours is the very
Integer, String), and the authors claim that a wider knowl-
preliminary investigation that we presented in [12]. While
edge of schema information is needed to enable the detection
[12] only provides a sketch of a MapReduce approach for
of other kinds of changes, like, for instance, the removal or
schema inference, in this paper we present results about a
renaming of attributes.
much deeper study. In particular, while in [12] a declarative
It is important to state that the problem of schema infer-
specification of only a few cases of the fusion process is pre-
ence has already been addressed in the past in the context
sented, in this paper we fully detail this process, provide a
of semi-structured and XML data models. In [18] and [19],
formal specification as well as a fusion algorithm. Further-
Nestorov et al. describe an approach to extract a schema
more, differently from [12], we present here an experimental
from semistructured data. They propose an object-oriented
evaluation of our approach validating our claims of paral-
type system where nodes are captured by classes built start-
lelizability and succinctness.
ing from nodes sharing the same incoming and outcoming
In [22] Wang et al. present a framework for efficiently man-
edges and where data edges are generalized to relations be-
aging a schema repository for JSON document stores. The
tween the classes. In [19], the problem of building a type
proposed approach relies on a notion of JSON schema called
out a of a collection of semistructured documents is studied.
skeleton. In a nutshell, a skeleton is a collection of trees de-
The emphasis is put on minimizing the size of the resulting
scribing structures that frequently appear in the objects of
type while maximizing its precision. Although that work
JSON data collection. In particular, the skeleton may to-
considers a very general data model captured by graphs, it
tally miss information about paths that can be traversed in
does not suit our context. Firstly, we consider the JSON
some of the JSON objects. In contrast, our approach enables
model, that is tree-shaped by nature and that features spe-
the creation of a complete yet succinct schema description
cific constructs such as arrays that are not captured by the
of the input JSON dataset. As already said, having such
semi-structured data model. Secondly, we aim at processing
a complete structural description is of vital importance for
potentially large datasets efficiently, a problem that is not
many tasks, like query optimisation, defining and enforc-
directly addressed in [18] and [19].
ing access-control security policies, and, importantly, giving
More recent efforts on XML schema inference (see [14] and
the user a global structural vision of the database that can
works cited therein) are also worth mentioning since they
help her in querying and exploring the data in an effective
are somewhat related to our approach. The aim of these ap-
way. Another important application of complete schema in-
proaches is to infer restricted, yet expressive enough forms
formation is query type checking: as illustrated in [12] our
of regular expressions starting from a positive set of strings
inferred schemas can be used to make type checking of Pig
representing element contexts of XML documents. While
Latin scripts much stronger.
XML and JSON both allow one to represent tree-shaped
In a very recent work [20], motivated by the need of laying
data, they have radical differences that make existing XML
the formal foundations for the JSON Schema language [3],
related approaches difficult to apply to the JSON setting.
Pezoa et al. present the formal semantics of that language,
225
Similar remarks hold for related approaches for schema in- basic array types AT = [T1 , . . . , Tn ], we also have the sim-
ference for RDF [11]. Furhermore, none of these approaches plified array type SAT = [T ∗], where T may be any type,
is designed to deal with massive datasets. including a union type.
A field OptRecT (l, T, . . .), represented as l : T ? in the
simplified notation, represents an optional field, that is, a
4. DATA MODEL AND TYPE LANGUAGE field that may be either present or absent in a record of
This section is devoted to formalizing the JSON data the corresponding type. For example, a type {l : Num?, m :
model and the schema language we adopt. (Str + Null)} describes records where l is optional and, if
We represent JSON values as records and arrays, whose present, contains a number, while the m field is mandatory
abstract syntax is given in Figure 2. Basic values B com- and may contain either null or a string.
prise null value, booleans, numbers n, and strings s. As A union type T + U contains the union of the values from
outlined in Section 2, records are sets of fields, each field be- T and those from U . The empty type denotes the empty
ing an association of a value V to a key l whereas arrays are set.1
sequences of values. The abstract syntax is practical for the We define now schema semantics by means of the function
formal treatment, but we will typically use the more read- J K, defined as the minimal function mapping types to sets
able notation introduced at the bottom of Figure 2, where of values that satisfies the following equations. For the sake
records as represented as {l1 : V1 , . . . , ln : Vn } and arrays of simplicity we omit the case of basic types.
are represented as [V1 , . . . , Vn ].
Auxiliary functions
M
S0 = {[ ]}
V ::= B|R|A Top-level values n+1 M
B ::= null | true | false | n | s Basic values S = {[V ] :: a | V ∈ S, a ∈ S n [ ]}
M
S∗ i
S
R ::= ERec | Rec(l, V, R) Records = i∈N S
A ::= EArr | Arr (V, A) Arrays
Records
Semantics: Domain : Sets(F S(Keys × Values))
M
Records JERecT K = {∅}
M
Domain : F S(Keys × Values) JRecT (l, T, RT )K = {{(l, V )} ∪ R | V ∈ JT K, R ∈ JRT K}
M M
JERecK = ∅ JOptRecT (l, T, RT )K = JRecT (l, T, RT )K ∪ JRT K
M
JRec(l, V, R)K = {(l, V )} ∪ JRK
Arrays Arrays and Simplified Arrays
Domain : Lists(Values) Domain : Sets(Lists(Values))
M
JEArr K
M
= [] JEArrT K = {[ ]}
M
JArr (V, A)K
M
= JV K :: A JArrT (T, AT )K = {[V ] :: A | V ∈ JT K, A ∈ JAT K}
M
J[T ∗]K = JT K∗
Notation:
M
{l1 : V1 , . . . , ln : Vn } = Rec(l1 , V1 , . . . Rec(ln , Vn , ERec)) Union types
M
[V1 , . . . , Vn ]
M
= Arr (V1 , . . . Arr (Vn , EArr )) JK = ∅
M
JT + U K = JT K ∪ JU K
Figure 2: Syntax of JSON data. The basic idea behind our type fusion mechanism is that
we always generalize the union of two record types to one
In JSON, a record is well-formed only if all its top-level record type containing the keys of both, and similarly for
keys are mutually different. In the sequel, we only consider the union of two array types. We express this idea as ‘merg-
well-formed JSON records, and we use Keys(R) to denote ing types that have the same kind’. The following kind()
the set of the top-level keys of R. function that maps each type to an integer ranging over
Since a record is a set of fields, we identify two records {0, . . . , 5} is used to implement this approach.
that only differ in the order of their fields.
The syntax of the JSON schema language we adopt is de- kind(Null) = 0 kind(Str) = 3
picted in Figure 3. The core of this language is captured by kind(Bool) = 1 kind(RT ) = 4
the non-terminals BT , RT , and AT which are a straightfor- kind(Num) = 2 kind(AT ) = kind(SAT ) = 5
ward generalization of their B, R and A counterparts from
the data model syntax. In the sequel, generic types are indicated by the metavari-
As previously illustrated in Section 2, we adopt a very ables T, U, W , while BT , RT , and AT are reserved for basic
specific form of regular types in order to prepare an array types, record types, and array types.
type for fusion. Before fusion, an array type [T1 , . . . , Tn ]
1
is simplified as [(T1 + . . . + Tn )∗], or, more precisely, as The type is never used during type inference, since no
[LFuse(T1 , . . . , Tn )∗]: instead of giving the content type el- value belongs to it. In greater detail, is actually a tech-
nical device that is only useful when an empty array type
ement by element as in [T1 , . . . , Tn ], we just say that it con- EArrT is simplified, before fusion, into a simplified array
tains a sequence of values all belonging to LFuse(T1 , . . . , Tn ) type: EArrT (that is, the type [ ]) is simplified as [∗], which
that will be defined as a compact super-type of T1 +. . .+Tn . has the same semantics as EArrT , and our algorithms never
This simplification is allowed by the fact that, besides the insert in any other position.
226
T ::= BT | RT | AT | SAT | | T + T Top-level types
BT ::= Null | Bool | Num | Str Basic types
RT ::= ERecT | RecT (l, T, RT ) | OptRecT (l, T, RT ) Record types
AT ::= EArrT | ArrT (T, AT ) Array types
SAT ::= [T ∗] Simplified array types
Notation:
M
{l1 : T1 [?], . . . , ln : Tn [?]} = [Opt]RecT (l1 , T1 , . . . [Opt]RecT (ln , Tn , ERecT )) ‘?’ is translated as ‘Opt’
M
[] = EArrT
M
[T1 , . . . , Tn ] = ArrT (T1 , . . . ArrT (Tn , EArrT ))
Later on, in order to express correctness of the fusion pro- The second phase of our approach is meant to fuse all the
cess we rely on the usual notion of subtyping (type inclu- types inferred in the first Map phase. The main mechanism
sion). of this phase is a binary fusion function, that is commutative
and transitive. These properties are crucial as they ensure
Definition 4.1 (Subtyping) Let T and U be two types. that the function can be iteratively applied over n types in
Then T is a subtype of U , denoted with T <: U , if and only a distributed and parallel fashion.
if JT K ⊆ JU K. When fusion is applied over two types T and U , it outputs
either a single type obtained by recursively merging T and
The subtyping relation is a partial order among types. U if they have the same kind, or the simple union T + U
We do not use any subtype checking algorithm in this work, otherwise. Since fusion may result in a union type, and since
but we exploit this notion to state properties of our schema this is in turn fused with other types, possibly obtained by
inference approach. fusion itself, the fusion function has to deal with the case
where union types T = T1 + . . . + Tn and U = U1 + . . . + Um
need to be fused. In this case, our fusion function identifies
5. SCHEMA INFERENCE and fuses types Tj and Uh with matching kinds, while types
As already said, our approach is based on two steps: i) of non-matching kinds are just moved unchanged into the
type inference for each single value in the input JSON data output union type. As we will see later, the fusion process
collection, and ii) fusion of types generated by the first step. ensures the invariant property that in each output union
We present these steps in the following two sections. type a given kind may occur at most once in each union;
hence, in the two union types above, n ≤ 6 and m ≤ 6,
5.1 Initial Schema Inference since we only have six different kinds.
The first step of our approach consists of a Map phase
that performs schema inference for each single value of the The auxiliary functions KMatch and KUnmatch, defined
input collection. Type inference for single values is done ac- in Figure 5, respectively have the purpose of collecting pairs
cording to the inference rules in Figure 4. Each rule allows of types of the same kind in two union-types T1 and T2 , and
one to infer the type of a value indicated in the conclusion of collecting non-matching types. In Figure 5, two similar
(part below the line) in terms of types recursively deter- functions FMatch and FUnmatch are defined. They identify
mined in the premises (part above the line). Rules with no and collect fields having matching/unmatched keys in two
premises deal with the terminal cases of the recursive typing input body record types RT1 and RT2 .
process, which infers the type of a value by simply reflect- These two functions are based on the auxiliary functions
ing the structure of the value itself. Note the particular ◦(T ) and (RT ). The function ◦(T ) transforms a union type
case of record values where uniqueness of attribute keys li T1 + . . . + Tn into the multiset of its addends, i.e non-union
is checked. Also notice that these rules are deterministic: types T1 , . . . , Tn . The function (RT ) transforms a record
each possible value matches at most the conclusion of one type {(l1 :T1 )m1 , . . . (ln :Tn )mn } into the set of its fields —
rule. These rules, hence, directly define a recursive typing in this case we can use a set since no repetition of keys is
algorithm. The following lemma states soundness of value possible. Here we use (l:T )1 to denote a mandatory field,
typing, and it can be proved by a simple induction. (l:T )? to denote an optional field, and the symbols m and n
for metavariables that range over {1, ?}.
Lemma 5.1 For any JSON value V , ` V ; T implies We are now ready to present the fusion function. Its for-
V ∈ JT K. mal specification is given in Figure 6. We use the function
⊕ (S), that is a right inverse of ◦(T ) and rebuilds a union
It is worth noticing that schema inference done in this phase type from a multiset of non-union types, and the function
does not exploit the full expressivity of the schema language.
(S), that is a right inverse of (RT ) and rebuilds a record
Union types, optional fields, and repetition types (the Sim- type from a set of fields. We also use min(m, n), which is
plified Array Types) are never inferred, while these types will a partial function that picks the “smallest” cardinality, by
be produced by the schema fusion phase described next. assuming ? < 1.
The general case where types T1 and T2 that may be union
5.2 Schema Fusion types have to be fused is dealt with by the Fuse(T1 , T2 )
227
(TypeNull) (TypeTrueBool) (TypeNumber) (TypeString)
(TypeEmptyRec) (TypeEmptyArray)
(TypeRec) (TypeArray)
` V ; T ` W ; RT l ∈ / Keys(RT ) ` V ; T ` W ; AT
` Rec(l, V, W ) ; RecT (l, V, RT ) ` Arr (V, W ) ; ArrT (T, AT )
◦(T ) : transforms a type into a multiset of non-union types, where ∪b is multiset union
◦(T1 + T2 ) := ◦(T1 ) ∪b ◦(T2 )
◦() := { }
◦(T ) := {T } when T 6= T1 + T2 and T 6=
KMatch(T1 , T2 ) := {(U1 , U2 ) | U1 ∈ ◦(T1 ), U2 ∈ ◦(T2 ), kind(U1 ) = kind(U2 )}
KUnmatch(T1 , T2 ) := {U1 ∈ ◦(T1 ) | ∀U2 ∈ T2 . kind(U1 ) 6= kind(U2 )}
∪{U2 ∈ ◦(T2 ) | ∀U1 ∈ ◦(T1 ). kind(U2 ) 6= kind(U1 )}
228
⊕ (S) : transforms a multiset of addends into a union type of these addends, right inverse for ◦(T )
⊕ ({ }) :=
⊕ ({T }) := T
⊕ ({T1 , T2 , . . . , Tn }) := T1 + ⊕ ({T2 , . . . , Tn }) when n ≥ 2
(S) : transforms a set of fields into a record type, right inverse for (RT )
(∅) := ERecT
({(l:T )1 } ∪ S) := RecT (l, T,
(S))
({(l:T )? } ∪ S) := OptRecT (l, T,
(S))
8. collapse(EArrT ) :=
9. collapse(ArrT (T, AT )) := Fuse(T, collapse(AT ))
Theorem 5.2 (Correctness of Fuse) Given two normal 2. Given three non-union normal types T , U and V of
types T1 and T2 , if T3 = Fuse(T1 , T2 ), then T1 <: T3 and the same kind, we have
T2 <: T3 .
LFuse(LFuse(T, U ), V ) = LFuse(T, LFuse(U, V ))
The proof of the above theorem relies on the following
lemma. 6. EXPERIMENTAL EVALUATION
Lemma 5.3 (Correctness of LFuse) Given two non-union In this section we present an experimental evaluation of
normal types T1 and T2 with the same kind, we have that our approach whose main goal is to validate our precision
T3 = LFuse(T1 , T2 ) implies both T1 <: T3 and T2 <: T3 . and succinctness claims. We also incorporate a preliminary
study on using our approach in a cluster-based environment
Another important property of fusion is commutativity. for the sake of dealing with complex large datasets.
Theorem 5.4 (Commutativity) The following two prop- 6.1 Experimental Setup and Datasets
erties hold. For our experiments, we used Apache Spark 1.6.1 [7] in-
stalled on two kinds of hardware. The first configuration
1. Given two normal types T1 , T2 , we have Fuse(T1 , T2 ) =
consists in a single Mac mini machine equipped with an In-
Fuse(T2 , T1 ).
tel dual core 2.6 Ghz processor, 16GB of RAM, and a SATA
2. Given two non-union normal types T and U having the hard-drive. This machine is mainly used for verifying the
same kind, we have LFuse(T, U ) = LFuse(U, T ). precision and succinctness claims. In order to assess the
scalability of our approach and its ability to deal with large
Associativity of binary type fusion is stated by the follow- datasets, we also exploited a small size cluster of six nodes
ing theorem. connected using a Gigabit link with 1Gb speed. Each node
is equipped with two 10-core Intel 2.2 Ghz CPUs, 64GB of
Theorem 5.5 (Associativity) The following two proper- RAM, and a standard RAID hard-drive.
ties hold. The choice of using Spark is intuitively motivated by its
widespread use as a platform for processing large datasets of
1. Given three normal types T1 , T2 , and T3 , we have
different kinds (e.g., relational, semi-structured, and graph
Fuse(Fuse(T1 , T2 ), T3 ) = Fuse(T1 , Fuse(T2 , T3 )) data). Its main characteristic lies in its ability to keep large
229
datasets into main-memory in order to process them in a fast This dataset is useful to assess the effectiveness of our typ-
and efficient manner. Spark offers APIs for major program- ing approach when dealing with arrays.
ming languages like Java, Scala, and Python. In particular,
Scala serves our case well since it makes the encoding of Wikidata.
pattern matching and inductive definitions very easy. Using The largest dataset comprises 21 million records reach-
Scala has, for instance, allowed us to implement both the ing a size of 75GB and corresponding to Wikipedia facts.
type inference and the type fusion algorithms in a rather These facts are structured following a fixed schema, but suf-
straightforward manner starting from their respective for- fer from a poor design compared to the previous datasets.
mal specifications. For instance, an important portion of Wikidata objects cor-
The type inference implementation extends the Json4s li- responds to claims issued by users. These user identifiers
brary [4] for parsing the input JSON documents. This li- are directly encoded as keys, whereas a clean design would
brary yields a specific Scala object for each JSON construct suggest encoding this information as a value of a specific key
(array, record, string, etc), and this object is used by our im- called id, for example. This dataset can be of interest to our
plementation to generate the corresponding type construct. experiments since several records reach a nesting level of 6.
The type fusion implementation follows a standard func-
tional programming approach and does not need to be com- NYTimes.
mented. The last dataset we are considering here is probably the
It is important to mention that the Spark API offers a most interesting one and comprises approximately 1.2 mil-
feature for extracting a schema from a JSON document. lion records and reaches the size of 22GB. Its objects feature
However, this schema inference suffers from two main draw- both nested records and arrays, and are nested up to 7 lev-
backs. First, the inferred schemas do not contain regular els. Most of the fields in records are associated to text data,
expressions, which prevents one from concisely representing which explains the large size of this dataset compared to the
repeated types, while our type system uses the Kleene-Star previous ones. These records encode metadata about news
to encode the repetition of types. Second, the Spark schema articles, such as the headline, the most prominent keywords,
extraction is imprecise when it comes to deal with arrays the lead paragraph as well as a snippet of the article itself.
containing mixed content, such as, for instance, an array of The interest of this dataset lies in the fact that the content
the form: of fields is not fixed and varies from one record to another.
[Num, Str, {l : Str}] A quick examination of an excerpt of this dataset has re-
vealed that the content of the headline field is associated,
In such a case, the Spark API uses type coercion yielding an in some records, to subfields labeled main, content kicker,
array of type String only. In our case, we can exploit union kicker, while in other records it is associated to subfields la-
types to generate a much more precise type: beled main and print headlines. Another common pattern
[(Num + Str + {l : Str})∗] in this dataset is the use of Num and Str types for the same
field.
For our experiments we used four datasets. The first two
datasets are borrowed from an existing work [13] and corre- In order to compare the results of our experiments us-
spond to data crawled from GitHub and from Twitter. The ing the four datasets, we decided to limit the size of every
third dataset consists in a snapshot of Wikidata [6], a large dataset to the first million records (the size of the small-
repository of facts feeding the Wikipedia portal. The last est one). We also created, starting from each dataset, sub-
dataset consists in a crawl of NYTimes articles using the datasets by restricting the original ones to respectively thou-
NYTimes API [5]. A detailed description of each dataset is sand (1K), ten thousands (10K) and one hundred thousands
provided in the sequel. (100K) records chosen in a random fashion. Table 1 reports
the size of each of these sub-datasets.
GitHub.
1K 10K 100K 1M
This dataset corresponds to metadata generated upon pull
requests issued by users willing to commit a new version GitHub 14MB 137MB 1.3GB 14GB
of code. It comprises 1 million JSON objects sharing the Twitter 2.2MB 22 MB 216MB 2.1GB
same top-level schema and only varying in their lower-level Wikidata 23MB 155MB 1.1GB 5.4GB
schema. All objects of this dataset consist exclusively of NYTimes 10MB 189MB 2GB 22GB
records, sometimes nested, with a nesting depth never greater
than four. Arrays are not used at all. Table 1: (Sub-)datasets sizes.
Twitter.
Our second dataset corresponds to metadata that are at- 6.2 Testing Scenario and Results
tached to the tweets shared by Twitter users. It comprises The main goal of our experiments is to assess the effec-
nearly 10 million records corresponding, in majority, to tweet tiveness of our approach and, in particular, to understand if
entities. A tiny fraction of these records corresponds to a it is able to return succinct yet precise fused types. To do
specific API call meant to delete tweets using their ids. This so we report in Tables 2 to 5, for each dataset, the number
dataset is interesting for our experiment for many reasons. of distinct types, the min, max, and average size of these
First, it uses both records and arrays of records, although types as well as the size of the fused type. The notion of
the maximum level of nesting is 3. Second, it contains five size of a type is standard, and corresponds to the size (num-
different top-level schemas sharing common parts. Finally, ber of nodes) of its Abstract Syntax Tree. For fairness, one
it mixes two kinds of JSON records (tweets and deletes). can consider the average size as a baseline wrt which we
230
compare the size of the fused type. This helps us judge the Inferred types size Fused
effectiveness of our fusion at collapsing common parts of the # types min. max. avg. type size
input types. 1K 555 299 887 597.25 88
From Tables 2, 3, and 4, it is easy to observe that our 10K 2,891 6 943 640 331
primary goal of succinctness is achieved for the GitHub and 100K 15,959 6 997 755 481
the Twitter datasets. Indeed, the ratio between the size of 1M 312,458 6 1,046 674 760
the fused type and that of the average size of the input types
is not bigger than 1.4 for GitHub whereas it is bounded by 4 Table 5: Results for NYTimes.
for Twitter, which are relatively good factors. These results
are not surprising: GitHub objects are homogeneous. Twit-
ter has a more varying structure and, in addition, it mixes in Table 6. As it can be observed, processing the Wiki-
two different kinds of objects that are deletes and tweets, data dataset is more time-consuming than processing the
as outlined in the description of this dataset. This explains two other datasets. This is explained, once again, by the
the slight difference in terms of compaction wrt GitHub. nature of the Wikidata dataset. Observe also that the pro-
As expected, the results for Wikidata are worse than the cessing time of GitHub is larger than that of Twitter due to
results for the previous datasets, due to the particularity the size of the former dataset that is larger than the latter
of this dataset concerning the encoding of user-ids as keys. one.
This has an impact on our fusion technique, which relies on
keys to merge the underlying records. Still, our fusion algo- 1K 10K 100K 1M
rithm manages to collapse the common parts of the input GitHub 1s 4s 32s 297s
types as testified by the fact that the size of the fused types Twitter 0 1s 7s 73s
is smaller than the sum of the input types.2 Finally, the Wikidata 7s 15s 121s 925s
results for NYtimes dataset, which features many irregular-
ities, are promising and even better than the rest. This can Table 6: Typing execution times.
be explained by the fact that the fields in the first level are
fixed while the lower level fields may vary. This does not
happen in the previous datasets, where the variations occur 6.3 Scalability
on the first level. To assess the scalability of our approach, we have deployed
the typing and the fusion implementations on our cluster. To
Inferred types size Fused exploit the full capacity of the cluster in terms of number of
# types min. max. avg. type size cores, we set the number of cores to 120, that is, 20 cores
1K 29 147 305 233 321 per node. We also assign to our job 300GB of main memory,
10K 66 147 305 239 322 hence leaving 72GB for the task manager and other runtime
100K 261 147 305 246 330 monitoring processes. We used the NYTimes full dataset
1M 3,043 147 319 257 354 (22GB) stored on HDFS. Because our approach requires two
steps (type inference and type fusion), we adopted a strategy
Table 2: Results for GitHub. where the results of the type inference step are persisted into
main-memory to be directly available to the fusion step. We
ran the experiments on datasets of varying size obtained by
Inferred types size Fused restricting the full one to the first fifty, two hundred-fifty and
# types min. max. avg. type size five hundred thousands records, respectively. The results
1K 167 7 218 74 221 for these experiments are reported in Table 7 together with
10K 677 7 276 75 273 some statistics on these datasets (number of records and
100K 2,320 7 308 75 277 cardinality of the distinct types). It can be observed that
1M 8,117 7 390 77 299 execution time increases linearly with the dataset size.
231
To overcome this problem, we considered a strategy based Houssem Ben Lahmar has been partially supported by the
on partitioning the input data that would force Spark to take project D03 of SFB/Transregio 161.3
full advantage of the cluster. In order to avoid the overhead
of data shuffling, the ideal solution would be to force com- 9. REFERENCES
putation to be local until the end of the processing. Because [1] The JSON Query Language. http://www.jsoniq.org.
Spark 1.6 does not explicitly allow such an option, we had [2] Json schema definition language.
to opt for a manual strategy where each partition of data http://jsoniq.org/docs/JSound/html-single/.
is processed in isolation, and each of the inferred schema [3] Json schema language. http://json-schema.org.
is finally fused with the others (this is a fast operation as
[4] Json4s library. http://json4s.org.
each schema to fuse has a very small size). The purpose
[5] Nytimes api. https://developer.nytimes.com/.
is to simulate the realistic situation where Spark processes
data exclusively locally, thus avoiding the overhead of syn- [6] Wikidata.
chronization. The times for processing each partition are re- https://dumps.wikimedia.org/wikidatawiki/entities/.
ported in Table 8. The average time is 2.85 minutes, which [7] Apache Spark. Technical report, 2016.
is a rather reasonable time for processing a dataset of 22 http://spark.apache.org.
GB. [8] V. Benzaken, G. Castagna, D. Colazzo, and
K. Nguyên. Type-based xml projection. VLDB ’06,
# objects # types time pages 271–282, 2006.
partition 1 284,943 67,632 2.4 min [9] K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin,
partition 2 300,000 83,226 3.8 min M. Y. Eltabakh, C. Kanne, F. Özcan, and E. J.
partition 3 300,000 89,929 1.9 min Shekita. Jaql: A scripting language for large scale
partition 4 300,000 84,333 3.3 min semistructured data analysis. PVLDB,
4(12):1272–1283, 2011.
Table 8: Partition-based processing of NYTimes. [10] T. Bray. The javascript object notation (JSON) data
interchange format, 2014.
Note that this simple yet effective optimization is possible [11] S. Cebiric, F. Goasdoué, and I. Manolescu.
thanks to the associativity of our fusion process. Query-oriented summarization of RDF graphs.
PVLDB, 8(12):2012–2015, 2015.
[12] D. Colazzo, G. Ghelli, and C. Sartiani. Typing massive
7. CONCLUSIONS AND FUTURE WORK json datasets. In XLDI ’12, Affiliated with ICFP, 2012.
The approach described in this paper is a first step to- [13] M. DiScala and D. J. Abadi. Automatic generation of
wards the definition of a schema-based mechanism for ex- normalized relational schemas from nested key-value
ploring massive JSON datasets. This issue is of great im- data. SIGMOD ’16, pages 295–310, 2016.
portance due to the overwhelming quantity of JSON data [14] D. D. Freydenberger and T. Kötzing. Fast learning of
manipulated on the web and due to the flexibility offered by restricted regular expressions and dtds. Theory
the systems managing these data. Comput. Syst., 57(4):1114–1158, 2015.
The main idea of our approach is to infer schemas for the [15] Z. H. Liu, B. Hammerschmidt, and D. McMahon. Json
input datasets in order to get insights about the structure of data management: Supporting schema-less
the underlying data; these schemas are succinct yet precise, development in rdbms. SIGMOD ’14, pages
and faithfully capture the structure of the input data. To 1247–1258, 2014.
this end, we started by identifying a schema language with [16] J. McHugh and J. Widom. Query optimization for
the operators needed to ensure succinctness and precision of xml. VLDB ’99, pages 315–326. Morgan Kaufmann
our inferred schemas. We, then, proposed a fusion mecha- Publishers Inc., 1999.
nism able to detect and collapse common parts of the input [17] M. Murata, D. Lee, M. Mani, and K. Kawaguchi.
types. An experimental evaluation on several datasets vali- Taxonomy of xml schema languages using formal
dated our claims and showed that our type fusion approach language theory. ACM Trans. Internet Technol.,
actually achieves the goals of succinctness, precision, and 5(4):660–704, Nov. 2005.
efficiency. [18] S. Nestorov, S. Abiteboul, and R. Motwani. Infering
Another benefit of our approach is its ability to perform structure in semistructured data. SIGMOD Record,
type inference in an incremental fashion. This is possible 26(4):39–43, 1997.
because the core of our technique, fusion, is incremental by [19] S. Nestorov, S. Abiteboul, and R. Motwani.
essence. One possible and interesting application would be Extracting schema from semistructured data. In L. M.
to process a subset of a large dataset to get a first insight on Haas and A. Tiwary, editors, SIGMOD 1998,
the structure of the data before deciding whether to refine Proceedings ACM SIGMOD International Conference
this partial schema by processing additional data. on Management of Data, June 2-4, 1998, Seattle,
In the near future we plan to enrich schemas with sta- Washington, USA., pages 295–306. ACM Press, 1998.
tistical and provenance information about the input data.
[20] F. Pezoa, J. L. Reutter, F. Suarez, M. Ugarte, and
Furthermore, we want to improve the precision of the infer-
D. Vrgoč. Foundations of json schema. WWW ’16,
ence process for arrays and study the relationship between
pages 263–273, 2016.
precision and efficiency.
[21] S. Scherzinger, E. C. de Almeida, T. Cerqueus, L. B.
de Almeida, and P. Holanda. Finding and fixing type
8. ACKNOWLEDGMENTS 3
http://www.trr161.de/interfak/forschergruppen/sfbtrr161
232
mismatches in the evolution of object-nosql mappings.
In T. Palpanas and K. Stefanidis, editors, Proceedings
of the Workshops of the EDBT/ICDT 2016, volume
1558 of CEUR Workshop Proceedings. CEUR-WS.org,
2016.
[22] L. Wang, S. Zhang, J. Shi, L. Jiao, O. Hassanzadeh,
J. Zou, and C. Wangz. Schema management for
document stores. Proc. VLDB Endow., 8(9):922–933,
May 2015.
233