0% found this document useful (0 votes)

78 views12 pages

Schema Inference For Massive JSON Datasets

Uploaded by

rohit thorawade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views12 pages

Schema Inference For Massive JSON Datasets

Uploaded by

rohit thorawade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Schema Inference for Massive JSON Datasets

Mohamed-Amine Baazizi Houssem Ben Lahmar Dario Colazzo

LIP6 IPVS Université Paris-Dauphine,
Université Pierre et Marie University of Stuttgart PSL Research University,
Curie Houssem.Ben- CNRS, LAMSADE
Mohamed- [email protected] 75016 PARIS, FRANCE
[email protected] stuttgart.de [email protected]
Giorgio Ghelli Carlo Sartiani
Dipartimento di Informatica DIMIE
Università di Pisa Università della Basilicata
[email protected] [email protected]

ABSTRACT
In the recent years JSON affirmed as a very popular data Big Data applications typically process and analyze very
format for representing massive data collections. JSON data large structured and semi-structured datasets. In many of
collections are usually schemaless. While this ensures sev- these applications, and in those relying on NoSQL docu-
eral advantages, the absence of schema information has im- ment stores in particular, data are represented in JSON
portant negative consequences: the correctness of complex (JavaScript Object Notation) [10], a data format that is
queries and programs cannot be statically checked, users widely used thanks to its flexibility and simplicity.
cannot rely on schema information to quickly figure out the JSON data collections are usually schemaless. This en-
structural properties that could speed up the formulation of sures several advantages: in particular it enables applica-
correct queries, and many schema-based optimizations are tions to quickly consume huge amounts of semi-structured
not possible. data without waiting for a schema to be specified. Unfor-
In this paper we deal with the problem of inferring a tunately, the lack of a schema makes it impossible to stati-
schema from massive JSON datasets. We first identify a cally detect unexpected or unwanted behaviours of complex
JSON type language which is simple and, at the same time, queries and programs (i.e., lack of correctness), users cannot
expressive enough to capture irregularities and to give com- rely on schema information to quickly figure out structural
plete structural information about input data. We then properties that could speed up the formulation of correct
present our main contribution, which is the design of a schema queries, and many schema-based optimizations are not pos-
inference algorithm, its theoretical study, and its implemen- sible.
tation based on Spark, enabling reasonable schema infer- In this paper we deal with the problem of inferring a
ence time for massive collections. Finally, we report about schema from massive JSON datasets. Our main goal in this
an experimental analysis showing the effectiveness of our ap- work is to infer structural properties of JSON data, that is,
proach in terms of execution time, precision, and conciseness a description of the structure of JSON objects and arrays
of inferred schemas, and scalability. that takes into account nested values and the presence of
optional values. These are the main properties that charac-
terize semi-structured data, and having a tool that ensures
CCS Concepts fast, precise, and concise inference is crucial in modern appli-
•Information systems → Semi-structured data; Data cations characterized by agile consumption of huge amounts
model extensions; •Theory of computation → Type of data coming from multiple and disparate sources.
theory; Logic; The approach we propose here is based on a JSON schema
language able to capture structural irregularities and com-
Keywords plete structural information about input data. This lan-
guage resembles and borrows mechanisms from existing pro-
JSON, schema inference, map-reduce, Spark, big data col-
posals [20], but it has the advantage to be simple yet very
lections
expressive.
The proposed technique infers a schema that provides a
1. INTRODUCTION global description of the whole input JSON dataset, while
having a size that is small enough to enable a user to consult
it in a reasonable amount of time, in order to get a global
knowledge of the structural and type properties of the JSON
collection. The description of the input JSON collection is
global in the sense that each path that can be traversed in
2017,
c Copyright is with the authors. Published in Proc. 20th Inter- the tree-structure of each input JSON value can be traversed
national Conference on Extending Database Technology (EDBT), March
21-24, 2017 - Venice, Italy: ISBN 978-3-89318-073-8, on OpenProceed-
in the inferred schema as well. This property is crucial to
ings.org. Distribution of this paper is permitted under the terms of the Cre- enable a series of query optimization tasks. For instance,
ative Commons license CC-by-nc-nd 4.0

Series ISSN: 2367-2005 222 10.5441/002/edbt.2017.21

thanks to this property JSON queries [1, 9] can be optimized function inspects the two input types and identifies parts
at compile-time by means of schema-based path rewriting that are mandatory, optional, or repeated in the types, in
and wildcard expansion [16] or projection [8]. These opti- order to obtain a type which is a super type of the two input
mizations are not possible if the schema hides some of the types (it includes them), but that is potentially much more
structural properties of the data, as happens in related ap- succinct than their simple union. A theoretical study shows
proaches [22]. that the fusion function is correct and, very importantly,
At the same time, our inferred schemas precisely capture associative.
the presence of optional and mandatory fields in collection Associativity is crucial as it allows Spark to safely dis-
of JSON records. Thanks to our approach, the user has tribute and parallelize the fusion of a massive collection of
a precise knowledge about i) all possible fields of records, values. Associativity is also important to enable incremen-
ii) optional ones, and iii) mandatory ones. Property i) is tal evolution of the inferred schema under updates. In many
crucial, as thanks to it the user can avoid time consum- applications the JSON sources are dynamic, and new values
ing, error-prone (approximated) data explorations to realize can be added at any time, with a structure that can differ
what fields can be really selected, while property ii) guides from that already inferred for previous records. In this situ-
the user towards the adoption of code to handle the op- ation, in the case of insertion of a new record in an existing
tional presence of certain fields; property iii), finally, indi- record collection, thanks to associativity, we simply need to
cates fields that can be always selected for each record in fuse the existing schema with the schema of the new record.
the collection. For incremental maintenance under other forms of updates,
A precise schema, like the one that can be inferred by our in the usual case that a massive dataset is kept partitioned
approach, can be very useful when very large datasets must and the updated parts are known, it just suffices to re-infer
be analyzed or queried with main-memory tools: indeed, by the schema for the updated parts and to fuse them with
identifying the data requirements of a query or a program previously inferred schemas for unchanged parts.
through a simple static analysis technique, it is possible to Our last contribution consists of an implementation of
match these requirements with the schema in order to load the proposed approach based on Spark, as well as an ex-
in main memory only those fragments of the input dataset perimental evaluation validating our claims of succinctness,
that are actually needed, hence improving both scalability precision, and efficiency. We based our tests on 4 real JSON
and performance. datasets. Our experiments confirm that our schema infer-
It is worth stressing that, even if in some cases JSON data ence algorithm returns very succinct yet precise schemas,
feature a rather regular structure, the only alternative way even in the presence of poorly organized data (i.e., Wikipedia
for the user to be sure that all possible (optional) fields are dataset). Furthermore, a scalability analysis reveals that
identified is to explore the entire dataset either manually or our approach ensures reasonable execution times, and that
by means of scripts that must be manually adapted to each a simple partitioning strategy allows the performance to be
particular JSON source, with weak guarantees of efficiency improved.
and soundness. Our approach instead applies to any JSON
data collection, and is shown to be sound and effective on Paper Outline. The paper is organized as follows. In Sec-
massive datasets. In addition, it is worth observing that, tion 2 we illustrate some scenarios that motivate our work.
while in many cases processed JSON data come from re- In Section 3, then, we survey existing works. In Section 4,
mote, uncontrolled sources, in other particular cases JSON we describe the data model and the schema language we use
data are generated by applications whose code is known. In here, while in Section 5 we present our schema inference ap-
these cases a wider knowledge is available about the struc- proach. In Sections 6 and 7, finally, we show the results of
ture of the program output, but again schema inference is our experimental evaluation and draw our conclusions.
important as it can highlight subtle structural properties
that can arise only in outputs of some particular program
runs; also, when the code starts being complex, it is difficult
to precisely figure out the structure of output JSON data. In 2. MOTIVATION AND OVERVIEW
some other cases, remote JSON sources can be accessed by
This section overviews the two steps of our schema fu-
APIs (e.g., Twitter APIs) that sometimes are provided with
sion approach: type inference and type fusion. To this end,
some schema descriptions. Unfortunately, these descriptions
we first briefly recall the general syntax and semantics of
are often incomplete, some fields are often ignored, and the
JSON values. As in most semi-structured models, JSON dis-
distinction between optional and mandatory fields is often
tinguishes between basic values, which range over numbers
omitted.
(e.g., 123), strings (e.g., “abc”), and booleans (i.e., true/-
false), and complex values which can be either (unordered)
Our Contribution. Our main contribution is the design of sets of key/value pairs called records or (ordered) lists of
a schema inference algorithm and its implementation based values called arrays. The only constraint that JSON values
on Apache Spark [7], in order to ensure reasonable schema must obey is key uniqueness within each record. Arrays can
inference time for massive collections. Our schema inference mix both basic and complex types. In the following, we will
approach consists of two main steps. In the first one, an in- use the term mixed-content arrays for arrays mixing atomic
put collection of JSON values is processed by a Map trans- and complex values.
formation in order to infer a simple type for each value. The A sample JSON record is illustrated in Figure 1. Syn-
resulting output is processed by a Reduce action, which fuses tactically, records use the conventional curly braces symbols
inferred types that are not necessarily identical, but that whereas arrays use square brackets; finally, string values and
share similar structure. This step relies on a binary function keys are wrapped inside quotes in JSON (but we will avoid
that takes two JSON types as input and fuses them. This quotes around keys in our formal syntax).

223
{"A": 123 resulting type and decorated with a question mark
"B": " The ..." ?.
"C": false To illustrate those cases, assume that T1 and T2 are,
"D": [ " abc " , " cde " , " fr 1 2 " ] respectively, {A:Str, B:Num} and {B:Bool, C:Str}.
}
The only matching key is “B” and hence its two atomic
Figure 1: A JSON record r1 .
types Num and Bool are fused, which yields Num + Bool.
The other keys will be optional according to rule R2 .
Type inference. Hence, fusion yields the type
Type inference, during the Map phase, is dedicated to T12 = {(A:Str)?, B:Num + Bool, (C:Str)?}
inferring individual types for the input JSON values, and
yields a set of distinct types to be fused during the Reduce Assume now that T12 is fused with
phase. Some proposals of JSON schemas exist in the liter- T3 = {A:Null, B:Num}
ature. With one exception [20], none of them uses regular
expressions which, as we shall illustrate, are important for Rules R1 and R2 need to be slightly adapted to deal
concisely representing types for array values. Moreover, a with optional types. Intuitively, we should simply con-
clean formal semantics specification of types is often miss- sider that optionality ‘?’ prevails over the implicit total
ing in these works, hence making it difficult to understand cardinality ‘1’. The resulting type is thus
their precise meaning. T123 = {(A:Str + Null)?, B:Num + Bool, (C:Str)?}.
The type language we adopt is meant to capture the core
features of the JSON data model with an emphasis on suc- Fusion of nested records eventually associates keys with
cinctness. Intuitively, basic values are captured using stan- types that may be unions of atomic types, record types,
dard data types (i.e., String, Number, Boolean), complex and array types. We will see that, when such types
values are captured by introducing record and array type are merged, we separately merge the atomic types, the
constructors, and a union type constructor is used to add record types, and the array types, and return the union
flexibility and expressive power. To illustrate the type lan- of the result. For instance, the fusion of types
guage, observe the following type that is inferred for the {l:(Bool + Str + {A:Num}}
record r1 given in Figure 1:
and
{A : Num, B : Str, C : Bool, D : [Str, Str, Str]}
{l:(Bool + {A:Str, B:Num})}
As we will show, the initial type inference is a quite simple
and fast operation: it consists of a simple traversal of the yields
input values that produces a type that is isomorphic to the {l:(Bool + Str + {A:(Num + Str), (B:Num)?}}.
value itself.
• Array types: array fusion deserves special attention. A
Type fusion. particular aspect to consider is that an array type ob-
Type fusion is the second step of our approach and consists tained in the first phase may contain several repeated
in iteratively merging the types produced during the Map types, and may feature mixed-content. To deal with
phase. Because it is performed during the Reduce phase in this, before fusing types we perform a kind of simpli-
a distributed fashion, type fusion relies on a fusion operator fication on bodies by using regular expression types,
which enjoys the commutativity and associativity proper- and, in particular, union + and repetition ∗. To illus-
ties. This fusion operator is invoked over two types T1 and trate this point, consider the array value
T2 , and produces a supertype of the inputs. To do so, the
[00 abc00 ,00 cde00 , {00E 00 :00f r00 , 00F 00 : 12}],
fusion collapses the parts of T1 and T2 that are identical and
preserves the parts that are distinct in both types. To this containing two strings followed by a record (mixed-
end, T1 and T2 are processed in a synchronised top-down content). The first phase infers for this value the type
manner in order to identify common parts. The main idea
[Str, Str, {E:Str, F :Num}].
is to represent only once what is common, and, at the same
time, to preserve all the parts that differ. This type can be actually simplified. For instance, one
Fusion treats atomic types, record types, and array types can think of a partition-based approach which collapses
differently, as follows. adjacent identical types into a star-guarded type, thus
transforming
• Atomic types: the fusion of atomic types is obvious, as
identical types are collapsed while different types are [Str, Str, {E:Str, F :Num}]
combined using the union operator.
into
• Record types: recall that valid record types enjoy key [(Str)∗, {E:Str, F :Num}]
uniqueness. Therefore, the fusion of T1 and T2 is led
by two rules: by collapsing the string types. The resulting schema
is indeed succinct and precise. However, succinctness
(R1 ) matching keys from both types are collapsed and cannot be guaranteed after fusion. For instance, if that
their respective types are recursively fused; type was to be merged with
(R2 ) keys without a match are deemed optional in the [{E:Str, F :Num}, Str, Str],

224
where strings and record swapped positions, succinct- as well as a theoretical study of its related expressive power
ness would be lost because we need to duplicate at least and validation problem. While that work does not deal with
one sub-expression, (Str)∗ or {E:Str, F :Num}. As we the schema inference problem, our schema language can be
are mainly concerned with generating types that can seen as a core part of the JSON Schema language studied
be human-readable, we trade some precision for suc- therein, and shares union types and repetition types with
cinctness and do not account for position anymore. To that one. These constructors are at the basis of our tech-
achieve this, in our simplification process (made before nique to collapse several schemas into a more succinct one.
fusing array types) we generalize the above partition- An alternative proposal for typing JSON data is JSound
based solution by returning the star-guarded union of [2]. That language is quite restrictive wrt ours and JSON
all distinct types expressed in an array. So, simplifica- Schemas: for instance it lacks union types.
tion for either In a very recent work [13] Abadi and Discala deal with
the problem of automatic transforming denormalised, nested
[Str, Str, {E:Str, F :Num}]
JSON data into normalised relational data that can be stored
or into a RDBMS; this is achieved by means of a schema gener-
ation algorithm that learns the normalised, relational schema
[{E:Str, F :Num}, Str, Str]
from data. Differently from that work, we deal with schemas
yields the same type that are far from being relational, and are closer to tree reg-
ular grammars [17]. Furthermore, the approach proposed in
S = [(Str + {E:Str, F :Num})∗].
[13] ignores the original structure of the JSON input dataset
After the array types have been simplified in this man- and, instead, depends on patterns in the attribute data val-
ner, they are fused by simply recursively fusing their ues (functional dependencies) to guide its schema genera-
content types, applying the same technique described tion. So, that approach is complementary to ours.
for record types: when the body type is a union type, In [15] Liu et al. propose storage, querying, and indexing
we separately merge the atomic components, the array principles enabling RDBMSs to manage JSON. The paper
components, and the record components, and take the does not deal with schema inference, but indicates a pos-
union of the results. sible optimisation of their framework based on the identifi-
cation of common attributes in JSON objects that can be
3. RELATED WORK captured by a relational schema for optimization purposes.
In [21] Scherzinger et al. propose a plugin to track changes in
The problem of inferring structural information from JSON
object-NoSQL mappings. The technique is currently limited
data collections has recently gained attention of the database
to only detect mismatches between base types (e.g., Boolean,
research community. The closest work to ours is the very
Integer, String), and the authors claim that a wider knowl-
preliminary investigation that we presented in [12]. While
edge of schema information is needed to enable the detection
[12] only provides a sketch of a MapReduce approach for
of other kinds of changes, like, for instance, the removal or
schema inference, in this paper we present results about a
renaming of attributes.
much deeper study. In particular, while in [12] a declarative
It is important to state that the problem of schema infer-
specification of only a few cases of the fusion process is pre-
ence has already been addressed in the past in the context
sented, in this paper we fully detail this process, provide a
of semi-structured and XML data models. In [18] and [19],
formal specification as well as a fusion algorithm. Further-
Nestorov et al. describe an approach to extract a schema
more, differently from [12], we present here an experimental
from semistructured data. They propose an object-oriented
evaluation of our approach validating our claims of paral-
type system where nodes are captured by classes built start-
lelizability and succinctness.
ing from nodes sharing the same incoming and outcoming
In [22] Wang et al. present a framework for efficiently man-
edges and where data edges are generalized to relations be-
aging a schema repository for JSON document stores. The
tween the classes. In [19], the problem of building a type
proposed approach relies on a notion of JSON schema called
out a of a collection of semistructured documents is studied.
skeleton. In a nutshell, a skeleton is a collection of trees de-
The emphasis is put on minimizing the size of the resulting
scribing structures that frequently appear in the objects of
type while maximizing its precision. Although that work
JSON data collection. In particular, the skeleton may to-
considers a very general data model captured by graphs, it
tally miss information about paths that can be traversed in
does not suit our context. Firstly, we consider the JSON
some of the JSON objects. In contrast, our approach enables
model, that is tree-shaped by nature and that features spe-
the creation of a complete yet succinct schema description
cific constructs such as arrays that are not captured by the
of the input JSON dataset. As already said, having such
semi-structured data model. Secondly, we aim at processing
a complete structural description is of vital importance for
potentially large datasets efficiently, a problem that is not
many tasks, like query optimisation, defining and enforc-
directly addressed in [18] and [19].
ing access-control security policies, and, importantly, giving
More recent efforts on XML schema inference (see [14] and
the user a global structural vision of the database that can
works cited therein) are also worth mentioning since they
help her in querying and exploring the data in an effective
are somewhat related to our approach. The aim of these ap-
way. Another important application of complete schema in-
proaches is to infer restricted, yet expressive enough forms
formation is query type checking: as illustrated in [12] our
of regular expressions starting from a positive set of strings
inferred schemas can be used to make type checking of Pig
representing element contexts of XML documents. While
Latin scripts much stronger.
XML and JSON both allow one to represent tree-shaped
In a very recent work [20], motivated by the need of laying
data, they have radical differences that make existing XML
the formal foundations for the JSON Schema language [3],
related approaches difficult to apply to the JSON setting.
Pezoa et al. present the formal semantics of that language,

225
Similar remarks hold for related approaches for schema in- basic array types AT = [T1 , . . . , Tn ], we also have the sim-
ference for RDF [11]. Furhermore, none of these approaches plified array type SAT = [T ∗], where T may be any type,
is designed to deal with massive datasets. including a union type.
A field OptRecT (l, T, . . .), represented as l : T ? in the
simplified notation, represents an optional field, that is, a
4. DATA MODEL AND TYPE LANGUAGE field that may be either present or absent in a record of
This section is devoted to formalizing the JSON data the corresponding type. For example, a type {l : Num?, m :
model and the schema language we adopt. (Str + Null)} describes records where l is optional and, if
We represent JSON values as records and arrays, whose present, contains a number, while the m field is mandatory
abstract syntax is given in Figure 2. Basic values B com- and may contain either null or a string.
prise null value, booleans, numbers n, and strings s. As A union type T + U contains the union of the values from
outlined in Section 2, records are sets of fields, each field be- T and those from U . The empty type denotes the empty
ing an association of a value V to a key l whereas arrays are set.1
sequences of values. The abstract syntax is practical for the We define now schema semantics by means of the function
formal treatment, but we will typically use the more read- J K, defined as the minimal function mapping types to sets
able notation introduced at the bottom of Figure 2, where of values that satisfies the following equations. For the sake
records as represented as {l1 : V1 , . . . , ln : Vn } and arrays of simplicity we omit the case of basic types.
are represented as [V1 , . . . , Vn ].
Auxiliary functions
M
S0 = {[ ]}
V ::= B|R|A Top-level values n+1 M
B ::= null | true | false | n | s Basic values S = {[V ] :: a | V ∈ S, a ∈ S n [ ]}
M
S∗ i
S
R ::= ERec | Rec(l, V, R) Records = i∈N S
A ::= EArr | Arr (V, A) Arrays
Records
Semantics: Domain : Sets(F S(Keys × Values))
M
Records JERecT K = {∅}
M
Domain : F S(Keys × Values) JRecT (l, T, RT )K = {{(l, V )} ∪ R | V ∈ JT K, R ∈ JRT K}
M M
JERecK = ∅ JOptRecT (l, T, RT )K = JRecT (l, T, RT )K ∪ JRT K
M
JRec(l, V, R)K = {(l, V )} ∪ JRK
Arrays Arrays and Simplified Arrays
Domain : Lists(Values) Domain : Sets(Lists(Values))
M
JEArr K
M
= [] JEArrT K = {[ ]}
M
JArr (V, A)K
M
= JV K :: A JArrT (T, AT )K = {[V ] :: A | V ∈ JT K, A ∈ JAT K}
M
J[T ∗]K = JT K∗
Notation:
M
{l1 : V1 , . . . , ln : Vn } = Rec(l1 , V1 , . . . Rec(ln , Vn , ERec)) Union types
M
[V1 , . . . , Vn ]
M
= Arr (V1 , . . . Arr (Vn , EArr )) JK = ∅
M
JT + U K = JT K ∪ JU K
Figure 2: Syntax of JSON data. The basic idea behind our type fusion mechanism is that
we always generalize the union of two record types to one
In JSON, a record is well-formed only if all its top-level record type containing the keys of both, and similarly for
keys are mutually different. In the sequel, we only consider the union of two array types. We express this idea as ‘merg-
well-formed JSON records, and we use Keys(R) to denote ing types that have the same kind’. The following kind()
the set of the top-level keys of R. function that maps each type to an integer ranging over
Since a record is a set of fields, we identify two records {0, . . . , 5} is used to implement this approach.
that only differ in the order of their fields.
The syntax of the JSON schema language we adopt is de- kind(Null) = 0 kind(Str) = 3
picted in Figure 3. The core of this language is captured by kind(Bool) = 1 kind(RT ) = 4
the non-terminals BT , RT , and AT which are a straightfor- kind(Num) = 2 kind(AT ) = kind(SAT ) = 5
ward generalization of their B, R and A counterparts from
the data model syntax. In the sequel, generic types are indicated by the metavari-
As previously illustrated in Section 2, we adopt a very ables T, U, W , while BT , RT , and AT are reserved for basic
specific form of regular types in order to prepare an array types, record types, and array types.
type for fusion. Before fusion, an array type [T1 , . . . , Tn ]
1
is simplified as [(T1 + . . . + Tn )∗], or, more precisely, as The type is never used during type inference, since no
[LFuse(T1 , . . . , Tn )∗]: instead of giving the content type el- value belongs to it. In greater detail, is actually a tech-
nical device that is only useful when an empty array type
ement by element as in [T1 , . . . , Tn ], we just say that it con- EArrT is simplified, before fusion, into a simplified array
tains a sequence of values all belonging to LFuse(T1 , . . . , Tn ) type: EArrT (that is, the type [ ]) is simplified as [∗], which
that will be defined as a compact super-type of T1 +. . .+Tn . has the same semantics as EArrT , and our algorithms never
This simplification is allowed by the fact that, besides the insert in any other position.

226
T ::= BT | RT | AT | SAT | | T + T Top-level types
BT ::= Null | Bool | Num | Str Basic types
RT ::= ERecT | RecT (l, T, RT ) | OptRecT (l, T, RT ) Record types
AT ::= EArrT | ArrT (T, AT ) Array types
SAT ::= [T ∗] Simplified array types
Notation:
M
{l1 : T1 [?], . . . , ln : Tn [?]} = [Opt]RecT (l1 , T1 , . . . [Opt]RecT (ln , Tn , ERecT )) ‘?’ is translated as ‘Opt’
M
[] = EArrT
M
[T1 , . . . , Tn ] = ArrT (T1 , . . . ArrT (Tn , EArrT ))

Figure 3: Syntax of the JSON type language.

Later on, in order to express correctness of the fusion pro- The second phase of our approach is meant to fuse all the
cess we rely on the usual notion of subtyping (type inclu- types inferred in the first Map phase. The main mechanism
sion). of this phase is a binary fusion function, that is commutative
and transitive. These properties are crucial as they ensure
Definition 4.1 (Subtyping) Let T and U be two types. that the function can be iteratively applied over n types in
Then T is a subtype of U , denoted with T <: U , if and only a distributed and parallel fashion.
if JT K ⊆ JU K. When fusion is applied over two types T and U , it outputs
either a single type obtained by recursively merging T and
The subtyping relation is a partial order among types. U if they have the same kind, or the simple union T + U
We do not use any subtype checking algorithm in this work, otherwise. Since fusion may result in a union type, and since
but we exploit this notion to state properties of our schema this is in turn fused with other types, possibly obtained by
inference approach. fusion itself, the fusion function has to deal with the case
where union types T = T1 + . . . + Tn and U = U1 + . . . + Um
need to be fused. In this case, our fusion function identifies
5. SCHEMA INFERENCE and fuses types Tj and Uh with matching kinds, while types
As already said, our approach is based on two steps: i) of non-matching kinds are just moved unchanged into the
type inference for each single value in the input JSON data output union type. As we will see later, the fusion process
collection, and ii) fusion of types generated by the first step. ensures the invariant property that in each output union
We present these steps in the following two sections. type a given kind may occur at most once in each union;
hence, in the two union types above, n ≤ 6 and m ≤ 6,
5.1 Initial Schema Inference since we only have six different kinds.
The first step of our approach consists of a Map phase
that performs schema inference for each single value of the The auxiliary functions KMatch and KUnmatch, defined
input collection. Type inference for single values is done ac- in Figure 5, respectively have the purpose of collecting pairs
cording to the inference rules in Figure 4. Each rule allows of types of the same kind in two union-types T1 and T2 , and
one to infer the type of a value indicated in the conclusion of collecting non-matching types. In Figure 5, two similar
(part below the line) in terms of types recursively deter- functions FMatch and FUnmatch are defined. They identify
mined in the premises (part above the line). Rules with no and collect fields having matching/unmatched keys in two
premises deal with the terminal cases of the recursive typing input body record types RT1 and RT2 .
process, which infers the type of a value by simply reflect- These two functions are based on the auxiliary functions
ing the structure of the value itself. Note the particular ◦(T ) and (RT ). The function ◦(T ) transforms a union type
case of record values where uniqueness of attribute keys li T1 + . . . + Tn into the multiset of its addends, i.e non-union
is checked. Also notice that these rules are deterministic: types T1 , . . . , Tn . The function (RT ) transforms a record
each possible value matches at most the conclusion of one type {(l1 :T1 )m1 , . . . (ln :Tn )mn } into the set of its fields —
rule. These rules, hence, directly define a recursive typing in this case we can use a set since no repetition of keys is
algorithm. The following lemma states soundness of value possible. Here we use (l:T )1 to denote a mandatory field,
typing, and it can be proved by a simple induction. (l:T )? to denote an optional field, and the symbols m and n
for metavariables that range over {1, ?}.
Lemma 5.1 For any JSON value V , ` V ; T implies We are now ready to present the fusion function. Its for-
V ∈ JT K. mal specification is given in Figure 6. We use the function
⊕ (S), that is a right inverse of ◦(T ) and rebuilds a union
It is worth noticing that schema inference done in this phase type from a multiset of non-union types, and the function
does not exploit the full expressivity of the schema language. (S), that is a right inverse of (RT ) and rebuilds a record
Union types, optional fields, and repetition types (the Sim- type from a set of fields. We also use min(m, n), which is
plified Array Types) are never inferred, while these types will a partial function that picks the “smallest” cardinality, by
be produced by the schema fusion phase described next. assuming ? < 1.
The general case where types T1 and T2 that may be union
5.2 Schema Fusion types have to be fused is dealt with by the Fuse(T1 , T2 )

227
(TypeNull) (TypeTrueBool) (TypeNumber) (TypeString)

` null ; Null ` true ; Bool ` n ; Num ` s ; Str

(TypeEmptyRec) (TypeEmptyArray)

` ERec ; ERecT ` EArr ; EArrT

(TypeRec) (TypeArray)
` V ; T ` W ; RT l ∈ / Keys(RT ) ` V ; T ` W ; AT
` Rec(l, V, W ) ; RecT (l, V, RT ) ` Arr (V, W ) ; ArrT (T, AT )

Figure 4: Type inference rules.

◦(T ) : transforms a type into a multiset of non-union types, where ∪b is multiset union
◦(T1 + T2 ) := ◦(T1 ) ∪b ◦(T2 )
◦() := { }
◦(T ) := {T } when T 6= T1 + T2 and T 6=
KMatch(T1 , T2 ) := {(U1 , U2 ) | U1 ∈ ◦(T1 ), U2 ∈ ◦(T2 ), kind(U1 ) = kind(U2 )}
KUnmatch(T1 , T2 ) := {U1 ∈ ◦(T1 ) | ∀U2 ∈ T2 . kind(U1 ) 6= kind(U2 )}
∪{U2 ∈ ◦(T2 ) | ∀U1 ∈ ◦(T1 ). kind(U2 ) 6= kind(U1 )}

(RT ) : transforms a record type into a set of fields

(ERecT ) := ∅
(RecT (l, T, RT )) := {(l:T )1 } ∪ (RT )
(OptRecT (l, T, RT )) := {(l:T )? } ∪ (RT )
FMatch(RT1 , RT2 ) := {((l:T )n , (k:U )m ) | (l:T )n ∈ (RT1 ) and (k:U )m ∈ (RT2 ) and l = k}
FUnmatch(RT1 , RT2 ) := {(l:T )n ∈ (RT1 ) | ∀(k:U )m ∈ (RT2 ). l 6= k} ∪ {(l:T )n ∈ (RT2 ) | ∀(k:U )m ∈ (RT1 ). l 6= k}

Figure 5: Auxiliary functions.

function. According to what was said before, it recursively

applies LFuse to pairs of types coming from T1 and T2 and
T = [Num, Bool, Num, {l1 : Num, l2 : Str}, {l1 : Num},
having the same kind, while unmatched types are simply
{l2 : Bool, l3 : Str}]
returned in the output union type.
The specification of LFuse is captured by lines 2 to 7. Line We have that collapse(T ) is equal to:
2 deals with the case where the input types are two identical
basic types. In this case, the fusion yields the input basic (Num + Bool + {l1 : Num, l2 : Str + Bool, (l3 : Str)?})
type. Line 3 deals with the case where the input types are
Note that only one record type is created, by iterating fusion
records. In this case, pairs of fields whose keys match are
over the three record types. Also note that there is a good
recursively fused by calling LFuse, the lowest cardinality
level of size reduction entailed by simplification. This hap-
is chosen for each, so that a field is mandatory only if is
pens in the most frequent cases (where elements of an array
mandatory in both record types, whereas the unmatching
share most of their structure), while size reduction becomes
fields are copied in the result type as optional fields.
weaker when very heterogeneous records appear in the ar-
The remaining lines of LFuse are dedicated to the case
ray body type (in the particular case where no field key is
where the input types are arrays. Each of these lines deals
shared among records, the unique record type given by sim-
with a combination among original and simplified arrays by
plification contains all keys, with their associated types, as
ensuring that Fuse is called over the body types of arrays
optional fields).
that have been simplified through the call of collapse. While
To conclude this section, the following theorems state
line 4 faces the case that the two types have not been sub-
the main theoretical properties of the fusion process: cor-
ject to fusion yet, lines 5-7 deal with the case that one of the
rectness, commutativity and associativity. The crucial role
input is the result of previous fusion operations, and there-
played by these properties has already been discussed in the
fore it has a *-expression as a body (recall the discussion in
previous sections.
Section 2). Lines 8 and 9 are dedicated to the array sim-
All these properties hold for types that respect the invari-
plification function collapse. This function simply relies on
ant that types of a given kind can occur at most once in
Fuse in order to generate an over-approximation of all the
each union. We use the term “normal types” to refer to such
different types that are found in the original array type, in
types. All of our algorithms respect this invariant, that is,
order to prepare the array type for the fusion process.
they only generate normal types.
To illustrate both body array type simplification and record
We first deal with correctness.
fusion, consider the following type T :

228
⊕ (S) : transforms a multiset of addends into a union type of these addends, right inverse for ◦(T )
⊕ ({ }) :=
⊕ ({T }) := T
⊕ ({T1 , T2 , . . . , Tn }) := T1 + ⊕ ({T2 , . . . , Tn }) when n ≥ 2

(S) : transforms a set of fields into a record type, right inverse for (RT )
(∅) := ERecT
({(l:T )1 } ∪ S) := RecT (l, T, (S))
({(l:T )? } ∪ S) := OptRecT (l, T, (S))

1. Fuse(T1 , T2 ) := ⊕ ({LFuse(U1 , U2 ) | (U1 , U2 ) ∈ KM } ∪b {U3 | U3 ∈ KU })

with KM = KMatch(T1 , T2 ), KU = KUnmatch(T1 , T2 )

2. LFuse(B, B) := B with kind(B) < 4

3. LFuse(RT1 , RT2 ) := ({l:Fuse(T1 , T2 )min(m,n) | ((l:T1 )m , (l:T2 )n ) ∈ FM }

∪{(l:T )? | (l:T )m ∈ FU })
with FM = FMatch(RT1 , RT2 ), FU = FUnmatch(RT1 , RT2 )

4. LFuse(AT1 , AT2 ) := [ Fuse(collapse(AT1 ), collapse(AT2 ))∗ ]

5. LFuse([T ∗], AT ) := [ Fuse(T, collapse(AT ))∗ ]
6. LFuse(AT, [T ∗]) := [ Fuse(collapse(AT ), T )∗ ]
7. LFuse([T1 ∗], [T2 ∗]) := [ Fuse(T1 , T2 )∗ ]

8. collapse(EArrT ) :=
9. collapse(ArrT (T, AT )) := Fuse(T, collapse(AT ))

Figure 6: The formal specification of the type fusion.

Theorem 5.2 (Correctness of Fuse) Given two normal 2. Given three non-union normal types T , U and V of
types T1 and T2 , if T3 = Fuse(T1 , T2 ), then T1 <: T3 and the same kind, we have
T2 <: T3 .
LFuse(LFuse(T, U ), V ) = LFuse(T, LFuse(U, V ))
The proof of the above theorem relies on the following
lemma. 6. EXPERIMENTAL EVALUATION
Lemma 5.3 (Correctness of LFuse) Given two non-union In this section we present an experimental evaluation of
normal types T1 and T2 with the same kind, we have that our approach whose main goal is to validate our precision
T3 = LFuse(T1 , T2 ) implies both T1 <: T3 and T2 <: T3 . and succinctness claims. We also incorporate a preliminary
study on using our approach in a cluster-based environment
Another important property of fusion is commutativity. for the sake of dealing with complex large datasets.

Theorem 5.4 (Commutativity) The following two prop- 6.1 Experimental Setup and Datasets
erties hold. For our experiments, we used Apache Spark 1.6.1 [7] in-
stalled on two kinds of hardware. The first configuration
1. Given two normal types T1 , T2 , we have Fuse(T1 , T2 ) =
consists in a single Mac mini machine equipped with an In-
Fuse(T2 , T1 ).
tel dual core 2.6 Ghz processor, 16GB of RAM, and a SATA
2. Given two non-union normal types T and U having the hard-drive. This machine is mainly used for verifying the
same kind, we have LFuse(T, U ) = LFuse(U, T ). precision and succinctness claims. In order to assess the
scalability of our approach and its ability to deal with large
Associativity of binary type fusion is stated by the follow- datasets, we also exploited a small size cluster of six nodes
ing theorem. connected using a Gigabit link with 1Gb speed. Each node
is equipped with two 10-core Intel 2.2 Ghz CPUs, 64GB of
Theorem 5.5 (Associativity) The following two proper- RAM, and a standard RAID hard-drive.
ties hold. The choice of using Spark is intuitively motivated by its
widespread use as a platform for processing large datasets of
1. Given three normal types T1 , T2 , and T3 , we have
different kinds (e.g., relational, semi-structured, and graph
Fuse(Fuse(T1 , T2 ), T3 ) = Fuse(T1 , Fuse(T2 , T3 )) data). Its main characteristic lies in its ability to keep large

229
datasets into main-memory in order to process them in a fast This dataset is useful to assess the effectiveness of our typ-
and efficient manner. Spark offers APIs for major program- ing approach when dealing with arrays.
ming languages like Java, Scala, and Python. In particular,
Scala serves our case well since it makes the encoding of Wikidata.
pattern matching and inductive definitions very easy. Using The largest dataset comprises 21 million records reach-
Scala has, for instance, allowed us to implement both the ing a size of 75GB and corresponding to Wikipedia facts.
type inference and the type fusion algorithms in a rather These facts are structured following a fixed schema, but suf-
straightforward manner starting from their respective for- fer from a poor design compared to the previous datasets.
mal specifications. For instance, an important portion of Wikidata objects cor-
The type inference implementation extends the Json4s li- responds to claims issued by users. These user identifiers
brary [4] for parsing the input JSON documents. This li- are directly encoded as keys, whereas a clean design would
brary yields a specific Scala object for each JSON construct suggest encoding this information as a value of a specific key
(array, record, string, etc), and this object is used by our im- called id, for example. This dataset can be of interest to our
plementation to generate the corresponding type construct. experiments since several records reach a nesting level of 6.
The type fusion implementation follows a standard func-
tional programming approach and does not need to be com- NYTimes.
mented. The last dataset we are considering here is probably the
It is important to mention that the Spark API offers a most interesting one and comprises approximately 1.2 mil-
feature for extracting a schema from a JSON document. lion records and reaches the size of 22GB. Its objects feature
However, this schema inference suffers from two main draw- both nested records and arrays, and are nested up to 7 lev-
backs. First, the inferred schemas do not contain regular els. Most of the fields in records are associated to text data,
expressions, which prevents one from concisely representing which explains the large size of this dataset compared to the
repeated types, while our type system uses the Kleene-Star previous ones. These records encode metadata about news
to encode the repetition of types. Second, the Spark schema articles, such as the headline, the most prominent keywords,
extraction is imprecise when it comes to deal with arrays the lead paragraph as well as a snippet of the article itself.
containing mixed content, such as, for instance, an array of The interest of this dataset lies in the fact that the content
the form: of fields is not fixed and varies from one record to another.
[Num, Str, {l : Str}] A quick examination of an excerpt of this dataset has re-
vealed that the content of the headline field is associated,
In such a case, the Spark API uses type coercion yielding an in some records, to subfields labeled main, content kicker,
array of type String only. In our case, we can exploit union kicker, while in other records it is associated to subfields la-
types to generate a much more precise type: beled main and print headlines. Another common pattern
[(Num + Str + {l : Str})∗] in this dataset is the use of Num and Str types for the same
field.
For our experiments we used four datasets. The first two
datasets are borrowed from an existing work [13] and corre- In order to compare the results of our experiments us-
spond to data crawled from GitHub and from Twitter. The ing the four datasets, we decided to limit the size of every
third dataset consists in a snapshot of Wikidata [6], a large dataset to the first million records (the size of the small-
repository of facts feeding the Wikipedia portal. The last est one). We also created, starting from each dataset, sub-
dataset consists in a crawl of NYTimes articles using the datasets by restricting the original ones to respectively thou-
NYTimes API [5]. A detailed description of each dataset is sand (1K), ten thousands (10K) and one hundred thousands
provided in the sequel. (100K) records chosen in a random fashion. Table 1 reports
the size of each of these sub-datasets.
GitHub.
1K 10K 100K 1M
This dataset corresponds to metadata generated upon pull
requests issued by users willing to commit a new version GitHub 14MB 137MB 1.3GB 14GB
of code. It comprises 1 million JSON objects sharing the Twitter 2.2MB 22 MB 216MB 2.1GB
same top-level schema and only varying in their lower-level Wikidata 23MB 155MB 1.1GB 5.4GB
schema. All objects of this dataset consist exclusively of NYTimes 10MB 189MB 2GB 22GB
records, sometimes nested, with a nesting depth never greater
than four. Arrays are not used at all. Table 1: (Sub-)datasets sizes.

Twitter.
Our second dataset corresponds to metadata that are at- 6.2 Testing Scenario and Results
tached to the tweets shared by Twitter users. It comprises The main goal of our experiments is to assess the effec-
nearly 10 million records corresponding, in majority, to tweet tiveness of our approach and, in particular, to understand if
entities. A tiny fraction of these records corresponds to a it is able to return succinct yet precise fused types. To do
specific API call meant to delete tweets using their ids. This so we report in Tables 2 to 5, for each dataset, the number
dataset is interesting for our experiment for many reasons. of distinct types, the min, max, and average size of these
First, it uses both records and arrays of records, although types as well as the size of the fused type. The notion of
the maximum level of nesting is 3. Second, it contains five size of a type is standard, and corresponds to the size (num-
different top-level schemas sharing common parts. Finally, ber of nodes) of its Abstract Syntax Tree. For fairness, one
it mixes two kinds of JSON records (tweets and deletes). can consider the average size as a baseline wrt which we

230
compare the size of the fused type. This helps us judge the Inferred types size Fused
effectiveness of our fusion at collapsing common parts of the # types min. max. avg. type size
input types. 1K 555 299 887 597.25 88
From Tables 2, 3, and 4, it is easy to observe that our 10K 2,891 6 943 640 331
primary goal of succinctness is achieved for the GitHub and 100K 15,959 6 997 755 481
the Twitter datasets. Indeed, the ratio between the size of 1M 312,458 6 1,046 674 760
the fused type and that of the average size of the input types
is not bigger than 1.4 for GitHub whereas it is bounded by 4 Table 5: Results for NYTimes.
for Twitter, which are relatively good factors. These results
are not surprising: GitHub objects are homogeneous. Twit-
ter has a more varying structure and, in addition, it mixes in Table 6. As it can be observed, processing the Wiki-
two different kinds of objects that are deletes and tweets, data dataset is more time-consuming than processing the
as outlined in the description of this dataset. This explains two other datasets. This is explained, once again, by the
the slight difference in terms of compaction wrt GitHub. nature of the Wikidata dataset. Observe also that the pro-
As expected, the results for Wikidata are worse than the cessing time of GitHub is larger than that of Twitter due to
results for the previous datasets, due to the particularity the size of the former dataset that is larger than the latter
of this dataset concerning the encoding of user-ids as keys. one.
This has an impact on our fusion technique, which relies on
keys to merge the underlying records. Still, our fusion algo- 1K 10K 100K 1M
rithm manages to collapse the common parts of the input GitHub 1s 4s 32s 297s
types as testified by the fact that the size of the fused types Twitter 0 1s 7s 73s
is smaller than the sum of the input types.2 Finally, the Wikidata 7s 15s 121s 925s
results for NYtimes dataset, which features many irregular-
ities, are promising and even better than the rest. This can Table 6: Typing execution times.
be explained by the fact that the fields in the first level are
fixed while the lower level fields may vary. This does not
happen in the previous datasets, where the variations occur 6.3 Scalability
on the first level. To assess the scalability of our approach, we have deployed
the typing and the fusion implementations on our cluster. To
Inferred types size Fused exploit the full capacity of the cluster in terms of number of
# types min. max. avg. type size cores, we set the number of cores to 120, that is, 20 cores
1K 29 147 305 233 321 per node. We also assign to our job 300GB of main memory,
10K 66 147 305 239 322 hence leaving 72GB for the task manager and other runtime
100K 261 147 305 246 330 monitoring processes. We used the NYTimes full dataset
1M 3,043 147 319 257 354 (22GB) stored on HDFS. Because our approach requires two
steps (type inference and type fusion), we adopted a strategy
Table 2: Results for GitHub. where the results of the type inference step are persisted into
main-memory to be directly available to the fusion step. We
ran the experiments on datasets of varying size obtained by
Inferred types size Fused restricting the full one to the first fifty, two hundred-fifty and
# types min. max. avg. type size five hundred thousands records, respectively. The results
1K 167 7 218 74 221 for these experiments are reported in Table 7 together with
10K 677 7 276 75 273 some statistics on these datasets (number of records and
100K 2,320 7 308 75 277 cardinality of the distinct types). It can be observed that
1M 8,117 7 390 77 299 execution time increases linearly with the dataset size.

Table 3: Results for Twitter. size # records # distinct types time

1GB 50,000 5,679 2 min
4.5GB 250,000 54,868 4.4 min
Inferred types size Fused 9GB 500,000 128,943 8.5 min
# types min. max. avg. type size 22GB 1,184,943 312,458 12.5 min
1K 999 27 36,748 1,215 37,258
10K 9,886 21 36,748 866 82,191 Table 7: Scalability - NYTimes dataset.
100K 95,298 11 39,292 607 87,290
1M 640,010 11 39,292 310 117,010
In an attempt to optimize the execution time on the clus-
ter, we started by analyzing the execution and realized that
Table 4: Results for Wikidata.
the full capacity of the cluster was not exploited. Indeed,
the HDFS uses only one node to store the entire dataset,
Execution times for the type inference and the type fusion which does not allow the parallelism to be exploited. We
for GitHub, Twitter, and Wikidata datasets are reported also observed that the intermediate results produced by the
2
The total size of input types can be roughly estimated by type inference step were split on only two nodes. The overall
multiplying either the minimum, maximum, or average size effect is that the computation was performed on two nodes
with the number of types. while the remaining four nodes were idle.

231
To overcome this problem, we considered a strategy based Houssem Ben Lahmar has been partially supported by the
on partitioning the input data that would force Spark to take project D03 of SFB/Transregio 161.3
full advantage of the cluster. In order to avoid the overhead
of data shuffling, the ideal solution would be to force com- 9. REFERENCES
putation to be local until the end of the processing. Because [1] The JSON Query Language. http://www.jsoniq.org.
Spark 1.6 does not explicitly allow such an option, we had [2] Json schema definition language.
to opt for a manual strategy where each partition of data http://jsoniq.org/docs/JSound/html-single/.
is processed in isolation, and each of the inferred schema [3] Json schema language. http://json-schema.org.
is finally fused with the others (this is a fast operation as
[4] Json4s library. http://json4s.org.
each schema to fuse has a very small size). The purpose
[5] Nytimes api. https://developer.nytimes.com/.
is to simulate the realistic situation where Spark processes
data exclusively locally, thus avoiding the overhead of syn- [6] Wikidata.
chronization. The times for processing each partition are re- https://dumps.wikimedia.org/wikidatawiki/entities/.
ported in Table 8. The average time is 2.85 minutes, which [7] Apache Spark. Technical report, 2016.
is a rather reasonable time for processing a dataset of 22 http://spark.apache.org.
GB. [8] V. Benzaken, G. Castagna, D. Colazzo, and
K. Nguyên. Type-based xml projection. VLDB ’06,
# objects # types time pages 271–282, 2006.
partition 1 284,943 67,632 2.4 min [9] K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin,
partition 2 300,000 83,226 3.8 min M. Y. Eltabakh, C. Kanne, F. Özcan, and E. J.
partition 3 300,000 89,929 1.9 min Shekita. Jaql: A scripting language for large scale
partition 4 300,000 84,333 3.3 min semistructured data analysis. PVLDB,
4(12):1272–1283, 2011.
Table 8: Partition-based processing of NYTimes. [10] T. Bray. The javascript object notation (JSON) data
interchange format, 2014.
Note that this simple yet effective optimization is possible [11] S. Cebiric, F. Goasdoué, and I. Manolescu.
thanks to the associativity of our fusion process. Query-oriented summarization of RDF graphs.
PVLDB, 8(12):2012–2015, 2015.
[12] D. Colazzo, G. Ghelli, and C. Sartiani. Typing massive
7. CONCLUSIONS AND FUTURE WORK json datasets. In XLDI ’12, Affiliated with ICFP, 2012.
The approach described in this paper is a first step to- [13] M. DiScala and D. J. Abadi. Automatic generation of
wards the definition of a schema-based mechanism for ex- normalized relational schemas from nested key-value
ploring massive JSON datasets. This issue is of great im- data. SIGMOD ’16, pages 295–310, 2016.
portance due to the overwhelming quantity of JSON data [14] D. D. Freydenberger and T. Kötzing. Fast learning of
manipulated on the web and due to the flexibility offered by restricted regular expressions and dtds. Theory
the systems managing these data. Comput. Syst., 57(4):1114–1158, 2015.
The main idea of our approach is to infer schemas for the [15] Z. H. Liu, B. Hammerschmidt, and D. McMahon. Json
input datasets in order to get insights about the structure of data management: Supporting schema-less
the underlying data; these schemas are succinct yet precise, development in rdbms. SIGMOD ’14, pages
and faithfully capture the structure of the input data. To 1247–1258, 2014.
this end, we started by identifying a schema language with [16] J. McHugh and J. Widom. Query optimization for
the operators needed to ensure succinctness and precision of xml. VLDB ’99, pages 315–326. Morgan Kaufmann
our inferred schemas. We, then, proposed a fusion mecha- Publishers Inc., 1999.
nism able to detect and collapse common parts of the input [17] M. Murata, D. Lee, M. Mani, and K. Kawaguchi.
types. An experimental evaluation on several datasets vali- Taxonomy of xml schema languages using formal
dated our claims and showed that our type fusion approach language theory. ACM Trans. Internet Technol.,
actually achieves the goals of succinctness, precision, and 5(4):660–704, Nov. 2005.
efficiency. [18] S. Nestorov, S. Abiteboul, and R. Motwani. Infering
Another benefit of our approach is its ability to perform structure in semistructured data. SIGMOD Record,
type inference in an incremental fashion. This is possible 26(4):39–43, 1997.
because the core of our technique, fusion, is incremental by [19] S. Nestorov, S. Abiteboul, and R. Motwani.
essence. One possible and interesting application would be Extracting schema from semistructured data. In L. M.
to process a subset of a large dataset to get a first insight on Haas and A. Tiwary, editors, SIGMOD 1998,
the structure of the data before deciding whether to refine Proceedings ACM SIGMOD International Conference
this partial schema by processing additional data. on Management of Data, June 2-4, 1998, Seattle,
In the near future we plan to enrich schemas with sta- Washington, USA., pages 295–306. ACM Press, 1998.
tistical and provenance information about the input data.
[20] F. Pezoa, J. L. Reutter, F. Suarez, M. Ugarte, and
Furthermore, we want to improve the precision of the infer-
D. Vrgoč. Foundations of json schema. WWW ’16,
ence process for arrays and study the relationship between
pages 263–273, 2016.
precision and efficiency.
[21] S. Scherzinger, E. C. de Almeida, T. Cerqueus, L. B.
de Almeida, and P. Holanda. Finding and fixing type
8. ACKNOWLEDGMENTS 3
http://www.trr161.de/interfak/forschergruppen/sfbtrr161

232
mismatches in the evolution of object-nosql mappings.
In T. Palpanas and K. Stefanidis, editors, Proceedings
of the Workshops of the EDBT/ICDT 2016, volume
1558 of CEUR Workshop Proceedings. CEUR-WS.org,
2016.
[22] L. Wang, S. Zhang, J. Shi, L. Jiao, O. Hassanzadeh,
J. Zou, and C. Wangz. Schema management for
document stores. Proc. VLDB Endow., 8(9):922–933,
May 2015.

233

What Is JSON
No ratings yet
What Is JSON
26 pages
Logarithms RD Sharma Solved Questions
100% (3)
Logarithms RD Sharma Solved Questions
9 pages
Session - 6 - Complex Data Types
No ratings yet
Session - 6 - Complex Data Types
27 pages
Big Data: 12. Document Stores
No ratings yet
Big Data: 12. Document Stores
165 pages
Json Final
No ratings yet
Json Final
45 pages
Schema Extraction and Structural Outlier Detection For JSON-based NoSQL Data Stores
No ratings yet
Schema Extraction and Structural Outlier Detection For JSON-based NoSQL Data Stores
20 pages
Python Jsonschema Readthedocs Io en Latest
No ratings yet
Python Jsonschema Readthedocs Io en Latest
76 pages
A Survey On Mapping Semi-Structured Data and Graph Data To Relational Data
No ratings yet
A Survey On Mapping Semi-Structured Data and Graph Data To Relational Data
57 pages
A Survey On Mapping Semi-Structured Data and Graph Data To Relational Data
No ratings yet
A Survey On Mapping Semi-Structured Data and Graph Data To Relational Data
38 pages
Datak D 22 00249
No ratings yet
Datak D 22 00249
28 pages
Software 03 00010 v2
No ratings yet
Software 03 00010 v2
20 pages
Paper 6 - Schema-Based JSON Data Stores in Relational Databases
No ratings yet
Paper 6 - Schema-Based JSON Data Stores in Relational Databases
34 pages
WINSEM2024-25 BITE304L TH VL2024250503415 2025-02-10 Reference-Material-II
No ratings yet
WINSEM2024-25 BITE304L TH VL2024250503415 2025-02-10 Reference-Material-II
21 pages
R Taha
No ratings yet
R Taha
24 pages
Ontologies in F-Logic
No ratings yet
Ontologies in F-Logic
23 pages
Docementing Rest Api
No ratings yet
Docementing Rest Api
20 pages
Bigdata Unit5
No ratings yet
Bigdata Unit5
20 pages
TDA357 L10 NoSQL, JSON1
No ratings yet
TDA357 L10 NoSQL, JSON1
41 pages
Sketch Data Structures
No ratings yet
Sketch Data Structures
12 pages
Enhancing Data Retrieval Efficiency in Large-Scale Javascript Object Notation Datasets by Using Indexing Techniques
No ratings yet
Enhancing Data Retrieval Efficiency in Large-Scale Javascript Object Notation Datasets by Using Indexing Techniques
12 pages
Pibiri 2020
No ratings yet
Pibiri 2020
12 pages
Making Sense of Schema-on-Read: Modeling JSON
No ratings yet
Making Sense of Schema-on-Read: Modeling JSON
49 pages
A Query-Oriented Adaptive Indexing Technique For Smart Grid Big Data Analytics
No ratings yet
A Query-Oriented Adaptive Indexing Technique For Smart Grid Big Data Analytics
13 pages
(IEEE 2024) A Generic Schema Evolution Approach For NoSQL
No ratings yet
(IEEE 2024) A Generic Schema Evolution Approach For NoSQL
16 pages
3 - Efficient Data Access
No ratings yet
3 - Efficient Data Access
7 pages
Managing Inconsistencies in Data Exchange
No ratings yet
Managing Inconsistencies in Data Exchange
24 pages
Debunking Some Myths About Structured and Unstruct
No ratings yet
Debunking Some Myths About Structured and Unstruct
15 pages
Semantic SPARQL Similarity Search Over RDF Knowledge Graphs
No ratings yet
Semantic SPARQL Similarity Search Over RDF Knowledge Graphs
25 pages
Ch-6 WDM
No ratings yet
Ch-6 WDM
7 pages
Electronics 10 02616 v2
No ratings yet
Electronics 10 02616 v2
27 pages
Semistructured Data : Peter Buneman
No ratings yet
Semistructured Data : Peter Buneman
5 pages
Oracle Json
No ratings yet
Oracle Json
13 pages
Adding StructType Columns To Spark DataFrames
No ratings yet
Adding StructType Columns To Spark DataFrames
6 pages
Understanding JSONSchema
No ratings yet
Understanding JSONSchema
74 pages
M2Onto: An Approach and A Tool To Learn Owl Ontology From Mongodb Database
No ratings yet
M2Onto: An Approach and A Tool To Learn Owl Ontology From Mongodb Database
10 pages
Graph NoSQL Data Warehouse Creation
No ratings yet
Graph NoSQL Data Warehouse Creation
5 pages
Scenario Series 19 - Handling JSON in Pyspark
No ratings yet
Scenario Series 19 - Handling JSON in Pyspark
8 pages
Mashql Ldow 090423153851 Slides
No ratings yet
Mashql Ldow 090423153851 Slides
34 pages
Higher-Order Relation Schema Induction Using Tensor Factorization With Back-Off and Aggregation
No ratings yet
Higher-Order Relation Schema Induction Using Tensor Factorization With Back-Off and Aggregation
10 pages
Mison: A Fast JSON Parser For Data Analytics
No ratings yet
Mison: A Fast JSON Parser For Data Analytics
12 pages
Web Data Integration Summary
No ratings yet
Web Data Integration Summary
10 pages
Tda357 L11 Json2
No ratings yet
Tda357 L11 Json2
45 pages
Data Structure Innovations For Machine Learning and AI Algorithms
No ratings yet
Data Structure Innovations For Machine Learning and AI Algorithms
4 pages
Week 1
No ratings yet
Week 1
12 pages
Introduction To JSON Lecture22
No ratings yet
Introduction To JSON Lecture22
36 pages
Lec06 Rdfsquery
No ratings yet
Lec06 Rdfsquery
7 pages
UnderstandingJSONSchema PDF
No ratings yet
UnderstandingJSONSchema PDF
36 pages
Dynamic Reducts and Its Properties in The Object-Oriented Rough Set Models
No ratings yet
Dynamic Reducts and Its Properties in The Object-Oriented Rough Set Models
8 pages
Graph Databases: Adrian Silvescu, Doina Caragea, Anna Atramentov
No ratings yet
Graph Databases: Adrian Silvescu, Doina Caragea, Anna Atramentov
14 pages
Q-JSON - Reduced JSON Schema With High Data Representation Efficiency
No ratings yet
Q-JSON - Reduced JSON Schema With High Data Representation Efficiency
4 pages
Automated Dynamic Schema Generation Using Knowledge Graph
No ratings yet
Automated Dynamic Schema Generation Using Knowledge Graph
9 pages
InfoBlox SNMP Enterprise MIB
No ratings yet
InfoBlox SNMP Enterprise MIB
20 pages
1 s2.0 S1570826810000612 Main
No ratings yet
1 s2.0 S1570826810000612 Main
9 pages
Slide 3
No ratings yet
Slide 3
35 pages
WT Unit 5
No ratings yet
WT Unit 5
10 pages
Learning Structure and Schemas From Heterogeneous Domains in Networked Systems: A Survey
No ratings yet
Learning Structure and Schemas From Heterogeneous Domains in Networked Systems: A Survey
8 pages
Keyword-Based Semantic Search Engine Koios++
No ratings yet
Keyword-Based Semantic Search Engine Koios++
4 pages
00 - PMP Exam - Knowledge Area Project Management Framework
100% (1)
00 - PMP Exam - Knowledge Area Project Management Framework
18 pages
Snowflake Flatten PDF
100% (2)
Snowflake Flatten PDF
17 pages
Maurice Swanenberg
No ratings yet
Maurice Swanenberg
36 pages
Convertion of Customer - Vendor To Business Partner
No ratings yet
Convertion of Customer - Vendor To Business Partner
35 pages
TV LCD Led Sharp Lc32le630e
No ratings yet
TV LCD Led Sharp Lc32le630e
176 pages
PMO Starter Kit White Paper
100% (2)
PMO Starter Kit White Paper
17 pages
Sea Cargo Manifest & Transhipment Regulations, 2018: Message Implementation Guide
No ratings yet
Sea Cargo Manifest & Transhipment Regulations, 2018: Message Implementation Guide
40 pages
Lexus NX 200t
No ratings yet
Lexus NX 200t
28 pages
Design Patterns in U See Book
No ratings yet
Design Patterns in U See Book
53 pages
CV Ofgg
No ratings yet
CV Ofgg
2 pages
Aliant Ommunications Imited: VCL-MX E1, 2 Mbps 30 Channel Drop-Insert Voice and Data Multiplexer
100% (1)
Aliant Ommunications Imited: VCL-MX E1, 2 Mbps 30 Channel Drop-Insert Voice and Data Multiplexer
26 pages
HPE StoreEasy 1670 Expanded Storage With Microsoft Windows Server IoT 2022-PSN1014773237WWEN
No ratings yet
HPE StoreEasy 1670 Expanded Storage With Microsoft Windows Server IoT 2022-PSN1014773237WWEN
5 pages
Uptime and Downtime Conversion Cheat Sheet
100% (1)
Uptime and Downtime Conversion Cheat Sheet
1 page
2 Milesight Profile - V5.1 - Mar 2025
No ratings yet
2 Milesight Profile - V5.1 - Mar 2025
29 pages
Open Source and Proprietary Tools in Digital Forensics
No ratings yet
Open Source and Proprietary Tools in Digital Forensics
10 pages
Nokia 1661 rh-121, rh-122 Service Manual-1,2 v1.1
No ratings yet
Nokia 1661 rh-121, rh-122 Service Manual-1,2 v1.1
15 pages
VN200 Datasheet Rev2
No ratings yet
VN200 Datasheet Rev2
2 pages
Ga 880GM Usb3l R311
No ratings yet
Ga 880GM Usb3l R311
31 pages
AWS Assignment
No ratings yet
AWS Assignment
7 pages
Analyzing Microsoft Office Documents
No ratings yet
Analyzing Microsoft Office Documents
10 pages
Happiest Minds - Initiating Coverage 280421
No ratings yet
Happiest Minds - Initiating Coverage 280421
20 pages
ENELTEC-QT-LED Explosion Proof Linear Lights
No ratings yet
ENELTEC-QT-LED Explosion Proof Linear Lights
1 page
Log Com - Gameloft.android - ANMP.GloftA3HM 1658505522
No ratings yet
Log Com - Gameloft.android - ANMP.GloftA3HM 1658505522
9 pages
Partha Sarathi Dey01
No ratings yet
Partha Sarathi Dey01
4 pages
TP 100 G8V NX 86 - en
No ratings yet
TP 100 G8V NX 86 - en
5 pages
An Efficient High Speed Wallace Tree Multiplier
No ratings yet
An Efficient High Speed Wallace Tree Multiplier
5 pages
Appendix F PMBOK
No ratings yet
Appendix F PMBOK
3 pages
Seminar On Electronic Fuel Injection 1
No ratings yet
Seminar On Electronic Fuel Injection 1
4 pages
FE Resume Sample
No ratings yet
FE Resume Sample
2 pages
MD Azizur Rahman: Curriculum Vitae
No ratings yet
MD Azizur Rahman: Curriculum Vitae
2 pages

Schema Inference For Massive JSON Datasets

Uploaded by

Schema Inference For Massive JSON Datasets

Uploaded by

Schema Inference for Massive JSON Datasets

Mohamed-Amine Baazizi Houssem Ben Lahmar Dario Colazzo

Series ISSN: 2367-2005 222 10.5441/002/edbt.2017.21

Figure 3: Syntax of the JSON type language.

` null ; Null ` true ; Bool ` n ; Num ` s ; Str

` ERec ; ERecT ` EArr ; EArrT

Figure 4: Type inference rules.

(RT ) : transforms a record type into a set of fields

Figure 5: Auxiliary functions.

function. According to what was said before, it recursively

1. Fuse(T1 , T2 ) := ⊕ ({LFuse(U1 , U2 ) | (U1 , U2 ) ∈ KM } ∪b {U3 | U3 ∈ KU })

2. LFuse(B, B) := B with kind(B) < 4

3. LFuse(RT1 , RT2 ) := ({l:Fuse(T1 , T2 )min(m,n) | ((l:T1 )m , (l:T2 )n ) ∈ FM }

4. LFuse(AT1 , AT2 ) := [ Fuse(collapse(AT1 ), collapse(AT2 ))∗ ]

Figure 6: The formal specification of the type fusion.

Table 3: Results for Twitter. size # records # distinct types time

You might also like

(RT ) : transforms a record type into a set of fields